We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22
TRAINING FEED FORWARD NEURAL NETWORKS
DR. SANJAY CHATTERJI
The Fast-Food Problem During training, we show the neural net a large number of training examples and iteratively modify the weights to minimize the errors we make on the training examples. One idea is to be intelligent about picking our training cases. Instead, we try to motivate a solution that works well in general. Gradient Descent Minimizing the squared error over all of the training examples by simplifying the problem. Linear neuron with two inputs has three-dimensional space where the horizontal dimensions correspond to the weights w1 and w2, and the vertical dimension corresponds to the value of the error function E. We can also conveniently visualize this surface as a set of elliptical contours. Contours correspond to settings of w1 and w2 that evaluate to the same value of E. We can develop a high-level strategy for how to find the values of the weights that minimizes the error function. The Delta Rule and Learning Rates Hyperparameters: In addition to the weight parameters defined in our neural network, learning algorithms also require a couple of additional parameters to carry out the training process. learning rate: at each step of moving perpendicular to the contour, we need to determine how far we want to walk before recalculating our new direction. It depends on the steepness of the surface. If we pick a learning rate that’s too small, we risk taking too long during the training process. But if we pick a learning rate that’s too big, we’ll mostly likely start diverging away from the minimum. Continued.. Now, we are finally ready to derive the delta rule for training our linear neuron. To calculate how to change each weight, we evaluate the gradient, which is essentially the partial derivative of the error function with respect to each of the weights. Gradient Descent with Sigmoidal Neurons Now, we will deal with training neurons and neural networks that utilize nonlinearities. We use the sigmoidal neuron as a model. For simplicity, we assume that the neurons do not use a bias term. The neuron computes the weighted sum of its inputs, the logit z. It then feeds its logit into the input function to
compute y, its final output.
Gradient Descent with Sigmoidal Neurons For learning, we want to compute the gradient of the error function with respect to the weights. To do so, we start by taking the derivative of the logit with respect to the inputs and the weights: The update rule is just like the delta rule, except with extra multiplicative terms. The Backpropagation Algorithm How to tackle the problem of training multilayer neural networks (instead of just single neurons)? We don’t know what the hidden units ought to be doing. We can compute how fast the error changes as we change a hidden activity. From there, we can figure out how fast the error changes when we change the weight of an individual connection. Essentially, we’ll be trying to find the path of steepest descent in high dimensional space! The Backpropagation Algorithm Cont. Our strategy will be one of dynamic programming. From ED(one layer of hidden units) we compute ED(activities of the below layer) and then we compute ED(weights leading into the unit). ED: Error Derivative We can express ED(layeri) in terms of ED(layerj). Once we fill up the table with all partial derivatives, we can determine how the error changes w.r.t. weights. This gives us how to modify the weights after each training example. Stochastic and Minibatch Gradient Descent Here we’ve been using a version of gradient descent known as batch gradient descent. Another potential approach is stochastic gradient descent, where at each iteration error surface is estimated only with respect to a single example. Instead of a single static error surface, here our error surface is dynamic. In mini-batch gradient descent, at every iteration, we compute the error surface with respect to some subset of the total dataset. Test Sets, Validation Sets, and Overfitting The complex model does not generalize well: Overfitting. It is very misleading to evaluate a model using the data we used to train it. At the end of each epoch, we want to measure how well our model is generalizing. As no. of connections and depth increases, propensity to overfit also increases ANN with two inputs, a softmax output of size two, and a hidden layer with 3, 6, or 20
ANN that have one, two,
or four hidden layers of three neurons each. workflow we use when building and training deep learning models Preventing Overfitting in Deep Neural Networks One method of combatting overfitting is called regularization. Regularization modifies the objective function that we minimize by adding additional terms that penalize large weights. We change the objective function so that it becomes Error + λ f(θ): λ is the regularization strength L2 regularization: we add ½ λw2 to the error function L1 regularization: we add the term λ|w| The value of λ determines how much we want to protect overfitting. λ = 0 we do not take any measures against the possibility of overfitting. λ is too large our model will prioritize keeping θ as small as possible over trying to find the parameter values that perform well on our training set. ANN using mini-batch gradient descent (batch size 10) and L1 regularization strengths of 0.01, 0.1, and 1 L1 Vs L2 Regularization The L1 regularization leads the weight vectors to become sparse during optimization (close to zero). So, neurons with L1 regularization end up using only a small subset of their most important inputs and become quite resistant to noise. In comparison, weight vectors from L2 regularization are usually diffuse, small numbers. L1 regularization is very useful when you want to understand exactly which features are contributing to a decision. If this level of feature analysis isn’t necessary, we prefer to use L2 regularization because it empirically performs better. Max norm constraints They have the goal of attempting to restrict θ from becoming too large. They do this more directly. They enforce an absolute upper bound c on the magnitude of the incoming weight vector for every neuron and use projected gradient descent to enforce the constraint. Typical values of c are 3 and 4. One of the nice properties is that the parameter vector cannot grow out of control (even if learning rates are too high) because the updates to the weights are bounded. Dropout Dropout is implemented by only keeping a neuron active with some probability p. It prevents the network from becoming too dependent on any one (or any small combination) of neurons. Mathematically, it prevents overfitting by providing a way of approximately combining exponentially many different neural network architectures efficiently. Inverted dropout: neuron whose activation hasn’t been silenced has its output divided by p before the value is propagated to the next layer. Intricacies Dropout is pretty intuitive to understand, but there are some important intricacies to consider. First, we’d like the outputs of neurons during test time to be equivalent to their expected outputs at training time. We could fix this naively by scaling the output at test time. For example, if p = 0.5, neurons must halve their outputs at test time in order to have the same (expected) output they would have during training. This means that if a neuron’s output prior to dropout was x, then after dropout, the expected output would be E output = px + (1−p) ・ 0 = px. Intricacies This naive implementation of dropout is undesirable, however, because it requires scaling of neuron outputs at test time. Test-time performance is extremely critical to model evaluation, so it’s always preferable to use inverted dropout, where the scaling occurs at training time instead of at test time. In inverted dropout, any neuron whose activation hasn’t been silenced has its output divided by p before the value is propagated to the next layer. With this fix, E [output] = p ・ x/p + (1 − p) ・ 0 = x, and we can avoid arbitrarily scaling neuronal output at test time. Thank You