0% found this document useful (0 votes)
65 views22 pages

Chap 2 Training Feed Forward Neural Networks

Uploaded by

HRITWIK GHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views22 pages

Chap 2 Training Feed Forward Neural Networks

Uploaded by

HRITWIK GHOSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

TRAINING FEED FORWARD NEURAL NETWORKS

DR. SANJAY CHATTERJI


The Fast-Food Problem
 During training, we show the neural net
a large number of training examples
and iteratively modify the weights to
minimize the errors we make on the
training examples.
 One idea is to be intelligent about
picking our training cases.
 Instead, we try to motivate a solution
that works well in general.
Gradient Descent
 Minimizing the squared error over all of the training examples by
simplifying the problem.
 Linear neuron with two inputs has three-dimensional space where the
horizontal dimensions correspond to the weights w1 and w2, and the
vertical dimension corresponds to the value of the error function E.
 We can also conveniently visualize this surface as a set of elliptical
contours.
 Contours correspond to settings of w1 and w2 that evaluate to the
same value of E.
 We can develop a high-level strategy for how to find the values of the
weights that minimizes the error function.
The Delta Rule and Learning Rates
 Hyperparameters: In addition to the weight parameters
defined in our neural network, learning algorithms also
require a couple of additional parameters to carry out the
training process.
 learning rate: at each step of moving perpendicular to the contour,
we need to determine how far we want to walk before
recalculating our new direction. It depends on the steepness of
the surface.
 If we pick a learning rate that’s too small, we risk taking too long
during the training process. But if we pick a learning rate that’s
too big, we’ll mostly likely start diverging away from the minimum.
Continued..
 Now, we are finally ready to derive the delta rule for
training our linear neuron.
 To calculate how to change each weight, we evaluate
the gradient, which is essentially the partial derivative of
the error function with respect to each of the weights.
Gradient Descent with Sigmoidal Neurons
 Now, we will deal with training neurons and neural
networks that utilize nonlinearities.
 We use the sigmoidal neuron as a model.
 For simplicity, we assume that the neurons do not use a
bias term.
 The neuron computes the weighted sum of its inputs, the
logit z.
 It then feeds its logit into the input function to

compute y, its final output.


Gradient Descent with Sigmoidal Neurons
 For learning, we want to compute the gradient of the error
function with respect to the weights.
 To do so, we start by taking the derivative of the logit with
respect to the inputs and the weights:
 The update rule is just like the delta rule, except with extra
multiplicative terms.
The Backpropagation Algorithm
 How to tackle the problem of training multilayer neural
networks (instead of just single neurons)?
 We don’t know what the hidden units ought to be doing.
 We can compute how fast the error changes as we
change a hidden activity.
 From there, we can figure out how fast the error changes
when we change the weight of an individual connection.
 Essentially, we’ll be trying to find the path of steepest
descent in high dimensional space!
The Backpropagation Algorithm Cont.
 Our strategy will be one of dynamic programming.
 From ED(one layer of hidden units) we compute
ED(activities of the below layer) and then we compute
ED(weights leading into the unit).
 ED: Error Derivative
 We can express ED(layeri) in terms of ED(layerj).
 Once we fill up the table with all partial derivatives, we
can determine how the error changes w.r.t. weights.
 This gives us how to modify the weights after each
training example.
Stochastic and Minibatch Gradient Descent
 Here we’ve been using a version of gradient descent known as batch
gradient descent.
 Another potential approach is stochastic gradient descent, where at each
iteration error surface is estimated only with respect to a single example.
 Instead of a single static error surface, here our error surface is dynamic.
 In mini-batch gradient descent, at every iteration, we compute the error
surface with respect to some subset of the total dataset.
Test Sets, Validation Sets, and Overfitting
 The complex model does not generalize well: Overfitting.
 It is very misleading to evaluate a model using the data
we used to train it.
 At the end of each epoch, we want to measure how well
our model is generalizing.
As no. of connections and depth increases,
propensity to overfit also increases
 ANN with two inputs, a
softmax output of size two,
and a hidden layer with 3,
6, or 20

 ANN that have one, two,


or four hidden layers of
three neurons each.
workflow we use when building and training deep learning
models
Preventing Overfitting in Deep Neural Networks
 One method of combatting overfitting is called regularization.
 Regularization modifies the objective function that we minimize by
adding additional terms that penalize large weights.
 We change the objective function so that it becomes Error + λ f(θ): λ is
the regularization strength
 L2 regularization: we add ½ λw2 to the error function
 L1 regularization: we add the term λ|w|
 The value of λ determines how much we want to protect overfitting.
λ = 0  we do not take any measures against the possibility of overfitting.
 λ is too large  our model will prioritize keeping θ as small as possible over
trying to find the parameter values that perform well on our training set.
ANN using mini-batch gradient descent
(batch size 10) and L1 regularization
strengths of 0.01, 0.1, and 1
L1 Vs L2 Regularization
 The L1 regularization leads the weight vectors to become sparse
during optimization (close to zero). So, neurons with L1
regularization end up using only a small subset of their most
important inputs and become quite resistant to noise.
 In comparison, weight vectors from L2 regularization
are usually diffuse, small numbers.
 L1 regularization is very useful when you want to understand
exactly which features are contributing to a decision.
 If this level of feature analysis isn’t necessary, we prefer to use L2
regularization because it empirically performs better.
Max norm constraints
 They have the goal of attempting to restrict θ from becoming too
large.
 They do this more directly.
 They enforce an absolute upper bound c on the magnitude of the
incoming weight vector for every neuron and use projected gradient
descent to enforce the constraint.
 Typical values of c are 3 and 4.
 One of the nice properties is that the parameter vector cannot grow
out of control (even if learning rates are too high) because the
updates to the weights are bounded.
Dropout
 Dropout is implemented by only keeping a neuron active with some
probability p.
 It prevents the network from becoming too dependent on any one (or any
small combination) of neurons.
 Mathematically, it prevents overfitting by providing a way of approximately
combining exponentially many different neural network architectures
efficiently.
 Inverted dropout: neuron whose activation hasn’t been silenced has its
output divided by p before the value is propagated to the next layer.
Intricacies
 Dropout is pretty intuitive to understand, but there are some important
intricacies to consider.
 First, we’d like the outputs of neurons during test time to be equivalent
to their expected outputs at training time.
 We could fix this naively by scaling the output at test time.
 For example, if p = 0.5, neurons must halve their outputs at test time
in order to have the same (expected) output they would have during
training.
 This means that if a neuron’s output prior to dropout was x, then after
dropout, the expected output would be E output = px + (1−p) ・ 0 =
px.
Intricacies
 This naive implementation of dropout is undesirable, however,
because it requires scaling of neuron outputs at test time.
 Test-time performance is extremely critical to model evaluation, so
it’s always preferable to use inverted dropout, where the scaling
occurs at training time instead of at test time.
 In inverted dropout, any neuron whose activation hasn’t been
silenced has its output divided by p before the value is propagated
to the next layer.
 With this fix, E [output] = p ・ x/p + (1 − p) ・ 0 = x, and we can
avoid arbitrarily scaling neuronal output at test time.
Thank You

You might also like