0% found this document useful (0 votes)
59 views9 pages

Back Propagation Algorithm PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
59 views9 pages

Back Propagation Algorithm PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
Fundamental Neurocomputing Concepts and Selected Neural Network Architectures and Learning Rules Care must be taken when one is selecting the learning rate parameter yu 1g ensure stability of the feedback during learning. A fixed learning rate para. meter can be used, or an adjustable one with respect to time can also be used (cf. Sect. 2.5.1). For each association, the iterative adjustments to the memory matrix in (3.37) continue until the error vector e,(r) in (3.33) becomes negi- gibly small. In other words, learning for the kth association x, yy can be stopped when the actual response M(r)x, is “close” to the desired response y, that is, yy — M(r)x, © 0. This results in minimizing the performance criterioy in (3.35), and allowing the associative memory to reconstruct the memorized patterns in an optimal sense. To initialize the algorithm for learning the mem- ory matrix Mf from the input/output pairs (x,, J], using the error correction 4 mechanism in (3.37), we set Af(0) = 0. Comparing these results to the least mean-square (LMS) algorithm in Section 2.5.1, we see that (3.37) is in the form of the LMS algorithm, or delta rule. 33 BACKPROPAGATION LEARNING ALGORITHMS. We now consider supervised learning in a feedforward multilayer perceptron (MLP). Specifically, we want to study the backpropagation algorithm, or the generalized delta rule, for training MLPs. Backpropagation is the most widely used learning process in neural networks today. and it was first developed by Werbos in 1974 [17]: however, this work remained unknown for many years [18, 19}. This method has been rediscovered several times, in 1982 by Parker: [20] (also see [21. 22}), in 1985 by LeCun [23], and by Rumelhart et al. in 1986: [24, 25]. The presentation of backpropagation by Rumelhart ct al. is probably’ responsible for the popularization of the algorithm in the areas of science and engincering. Training MLPs with backpropagation algorithms results in a nonlinear mapping or an association task. Thus, given two sels of data, that is, input/output pairs, the MLP can have its synaptic weights adjusted by the backpropagation algorithm to develop a specific nonlinear mapping (¢. Sect. 2.7). The MLP, with fixed weights after the training process, can provide an association task for classification. pattern recognition. diagnosis, ett: During the training phase of the MLP, the synaptic weights are adjusted to. minimize the disparity between the actual and desired outputs of the MLP, averaged over all input patterns (or learning examples). i 3.3.1 Basic Backpropagation Algorithm for the Feedforward Multilayer Perceptron In this section we present a derivation of the standard backpropagation lea ing algorithm. For the sake of simplicity we will derive the learning rule fot: multilayer perceptron neural network (MLP NN) having three layers. weights, namely, one output layer and two hidden layers, since it is most frequently used MLP NN architecture. An example of this type of neu network is shown in Figure 34, The extension of the derivation to the-general P taco when the network has more than two hidden layers is straightforward. "the standard backpropagation algorithm for training of the MLP NN is PP ised on the steepest descent gradient approach applied to the minimization of an energy function representing the instantaneous error. in other words, we desire to minimize a function defined as B= H(t) (G28) =F (eta) a0 2 where d, represents the desired network output for the qth input pattern and x°), = y, is the actual output of the MLP network shown in Figure 3.4. Very ‘@fien the method for the weight updates derived from minimizing (3.40) is © called the online method, emphasizing the fact that it-has minimum memory storage requirements. Using the steepest-descent gradient approach, the learning rule for a net- Fs vork weight in any one of the network layers is given by (BE ” Awe = G41) ‘i (Activity level vector “(Weight matix “for (Weight matin (Activiy (Weight matic (Qj.Mforlayer 1) ayer 1) forlayer2)—level_—_‘for layer 3) ms i we vectr wo (Activity ae et layer) vecter Dv) ° | Dy Pvp— E et) : sto) : D)— [7°] +», — Dope se a0 oem ea mee os weer ng A First hidden layer ‘Second hidden layer Output layer Response ‘neurons 1 neurons vector (ourput) ye Rs*t 3.4 An example of a three-layer feedforward MLP NN architecture. 107 ‘CHAPTER 3: Mapping Networks 108 YART Fundamental Neurecomputing Concepts and Selected Neural Network, Architectures and Learning Rules where s = 1,2, 3 designates the /PPropriate network layer and je) > O is the corresponding learning rate parameter, For reasons which will become appar- ent shortly, we will derive separate learning rules for weights in the output and in the hidden layers of the MLP NN. Let us first consider the output layer of the network, The weights in the output layer can be updated according to (3.42) o i Aw)? =—1 ag ant (3.43) Separated terms in (3.43) can be evaluated as a? 9 3 an awe (Ay (3.44) f fai [3 [4 -1()F| (3.45) 5 = ~(ay — ely) 2-9 (3.46) yi where g(+) represents the first derivative of the nonlinear activation function J(+). The term defined in (3.46) is commonly referred to as local error, ar delta. Combining (3.43), (3.44), and (3.46) we can write the learning rule equa- tion for the weights in the output layer of the network as Awl? = psx , (3.47) or WME + 1) = WOE) bP? , (3.48) The update equations for the weights in the hidden layers of the network can be derived in essentially the sarne way. Applying the steepest descent gradient approach, we have ey ae an} The second partial derivative on the right-hand side in (3.49) can be evaluated as we? Auf (3.49) ap (3.50) o (2) 41) (= we sh) - / vative in (3.49) is more complex since the he output layer of the network and affects Gil the network outputs. We can pursue the derivation by expressing this quantity as a function of quantities that are already known and of other terms which are easily evaluated. To proceed, we can write fy 6(Sypna,)) | oe anf Evaluation of the first partial deri change in vf" propagates through t or [Eo —zu)eGoryoe se") - . (3.52) _ (& te) a fat Combining equations (3.49), (3.50), and (3.52) yields bu) = w5. (3.53) or WAP + 1) = hk) + wap (3.54) ‘Comparing (3.48) and (3.54). we sce that the update equations for the weights in the output layer and the hidden layer have the same form. The only differ- tence lies in how we compute the local error. For the output layer, the local error is proportional to the difference between the desired output and the actual network output, By extending the same concept to the “outputs” of the hidden layers, the local error for a neuron in a hidden layer can be viewed as being proportional to the difference between the desired output and actual ‘output of the particular neuron. Of course, during the training process, the desired outputs of the neurons in the hidden layer are not known, and there- fore the local errors need to be recursively estimated in terms of the error signals of all connected neurons. Equation (3.54) can be generalized for the MLP NN having an arbitrary number of hidden layers. For such a network we can write WP + 1) = WP) + PA, ss) where 8 = (du 28hu)e(¥?) 3.56) 109 ‘CHAPTER 3: Mapping Networks r ho PART Fundamental Neurocomputing Concepts and Selected Neural Network Architectures and Learning Rules for the output layer, and a (= ac) G57) at for the hidden layers. Summary of the standard backpropagation algorithm Training the MLP NN by using the standard backpropagation algorithm’ can be performed according to the following algorithm. Standard backpropagation algorithm fe the network synaptic weights to small random values. From the set of training input/output pairs, present an input pattern and calculate the network response, : Step 3. The desired network response is compared with the actual output of the network, and by using (3.56) and (3.57) all the local errors can be computed. Step 4. The weights of the network are updated according to (3.55). ‘Step 5. Until the network reaches a predetermined level of accuracy in producing the adequate response for all the (raining patterns, continue steps 2 through 4. From the above algorithm we see that the classical backpropagation can be interpreted as performing two independent tasks. The first is backpropaga- tion of the errors from the nodes in the output layer to the nodes in the hidden layers, and the second is using the LMS algorithm to update the weights in every layer. 3.3.2 Some Practical Issues in Using Standard Backpropagation Standard backpropagation and its derivatives are by far the most widely used learning algorithm for training of the MLP NN. In this section we address some of the practical problems involved in its effective-application. Initialization of synaptic weights The weights of the MLP NN are initially set to small random values. They have to be sufficiently small so that network training does not start from a point in the error space that corresponds to some of the nodes being saturated. When the network operates in saturation, it may take a lot of iterations for the learning to converge. One commonly used heuristic algorithm for weight initialization is to set the weights as uniformly distributed random numbers in the interval from —0.5/fan_in to 0.5/fan_in, where fan_in represents the total number of the neurons in the layer that the weights are fed into [26]. For the case of MLP NN with one hidden layer, an alternate approach was suggested by Nguyen and Widrow [27]. The authors demonstrate that this approach can significantly improve the speed of the network training. Nguyen and Widrow's initialization of MLP NNs can be summarized in the following algorithm. “han sce ll Nilae a ie a 4 tion algorithm rng = number of components in input layer inj = number of neurons in hidden layer y_= sealing factor Compute the scaling factor according to y=0.7 ym (3.58) Initialize the weights wy of «layer as random numbers between —05 and 0. 3. Reinitialize the weights according to my =r as) Network configuration and ability of the network to generalize ‘The configuration of the MLP NN is determined by the number of hidden 2 ayers, number of the neurons in each of the hidden layers, as well as the type =< of the activation functions used for the neurons. While it has been proved that Rihe performance of the network does not depend much on the type of the activation function (as long as it is nonlinear), the choice of the number of ‘hidden layers and the number of units in each of the hidden layers is critical. SU! “Hornik et al. (28] established that an MLP NN that has only one hidden =, layer, with a sufficient number of neurons, acts as a universal approximator of We nonlinear mappings. In practice, it is very difficult to determine a sufficient Bx number of neurons necessary to achieve the desired degree of approximation Iaecuracy. Frequently, the number of units in the hidden layer is determined by al and error. Furthermore, if the network has only one hidden layer, the =@Eieurons seem to “interact” with one another (29]. In such a situation, it is *“Sdifficult to improve the approximation for one point in the mapping without ’Wegrading it at some other point. For the above reason, MLP NNs are com- jonly designed with two hidden layers. ‘tt! Typically, to solve a real-world problem using MLP NNs, we need to train "relatively large neural network architecture, Having a large number of units the hidden layers guarantees good network performance when it is pre- nted with input patterns that belong to the training set. However, an “over- = designed” architecture will tend to “overfit” the training data [30-32], which ‘results in the loss of the generalization property of the network. To clarify this It|"conisider the following example. NAMPLE 3.2. An MLP NN is to be trained to approximate the nonlinear Eulction ain y=e™* sina) (3.60) tin the interval (0, 4]. This is.a fairly simple problem for a neural network with one BVinidden layer consisting of 50 neurons. For this example, the interval (0,4) was CHAPTER 3: Mapping Networks 112 1 PART os Fundamental : Neurocomputing Concepts and Selected Neural Network Architectures and Learning Rules Function approximation “MOOS Tis 2 as 3s 4 Input @ Figure 3.5 training sets 4" i é Ilustration of data overfitting. (a) Response of the network to the training inputs. (6) Response of the network o the test data set. 02 04 -06 NSCS IIIT Input © sampled with 21 points separated by 0.2. MATLAB routine t rain 1m was used | to perform the network training with hyperbolic tangent nonlinearities and a target mean square error of 0.01 over the entire data set. The network converged " in only 5 epochs, and Figure 3.5(a) illustrates the agreement between the desired: and actual network outputs for the training data set. To test the network's gen- eralization capability, the same interval was sampled with 401 points separated by ] 0.01, and the network response is presented in Figure 3.5(b). As seen in Figure | 3.5(6), the response of the network does not show good agreement with the | function we are trying to approximate. The reason for this is overfitting of the | training data, In this case, the network with a considerably smaller number of neurons would perform the approximation task in a better way. { Independent validation It is never a good idea to assess the generalization properties of a neural | — network based on the training data alone. Using the training data to assess the final performance quality of the network can lead to overfitting. This can be avoided by using a standard method in statistics called independent validation (cf. Sect. 9.4). The method involves dividing the available data into a training set and a test set. The entire data set is usually randomized first. The training data are next split into two partitions; the first partition is used to update the weights in the network, and the second partition is used to assess (or validate) the training performance (i.c., it is used to decide when to stop training). The test data are then used to assess how well the network has generalized. --- , Speed of convergence it al As seen from the derivation of the standard backpropagation algorithm, it} is a generalization of the LMS algorithm presented in Section 2.5.1. Conversely, the LMS algorithm for training a single-layer perceptron can be seen as a special case of the standard backpropagation algorithm. In Section 2.5.1 it was demonstrated that the rithm (in particular its speed and stability of the learning rate parameter.:To guarant ie see ‘convergence properties of the LMS algo- )) depend critically on the magnitude ite network convergence, and avoid a —— 113 oscillations during the training, the learning rate parameter must be set to + learning rate parameter restricts the chan : weigh Furthermore, if the starting point of the network training is far ‘bal minimum, some of the neurons can be operating in saturation. , the derivative of the activation function is small, Since the ly on the magnitude of the Berheinciwork. from the glo! When that happens. SHlagnitude of the weight changes depends direct! function derivative, the network can get stuck on a flat plateau of 'e error surface, and it might require many iterations before convergence Ng achieved. It is not uncommon for even moderately complex real-world D fjioblems to require hours, even days, of network training. Fee Slow convergence of the backpropagation ‘algorithm has encouraged “esearch in alternate (faster) algorithms for MLP NN training. Research on {faster algorithms can be roughly divided intottwo main categories. The first ie ry consists of varios heuristic improvements to the standard backpro- Erpagation algorithm. Although useful, and in many cases easily understand- “ve-able, the heuristic algorithms are very much case-by-case specific, and their <=. performance characteristics cannot be easily established. The second category —SYinvolves use of standard numerical ‘optimization techniques. Most of the algo- T Mthms in this category give significant improvement in the network conver “fugence speed at the expense of increased network computational complexity. or representative and popular algorithms from both categories are pre~ we ented in the following sections. son addition to modification of backpropagation learning. preprocessing sad reduction in the input data can result in improved performance and faster earning. That is, reduction in the network size reduces its ‘complexity, and this nificantly improves the convergence speed. Some of the methods for the fa preprocessing are covered in Section 2.9. --ayie:33.3 Backpropagation Learning Algorithm with Momentum Updating Pith -Backpropagation with mor.ientum updating is one of the most popular mod- fications to the standard algorithm presented in Section 3.3. The idea of the gorithm is to update the weights in the direction which is linear combina- ‘s, gtion of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training, Namely, the weights are updated according to lk) badw(k 1) G61) awit 1) =n ack! iP +1) = wey + wah ys hd + 8fe— Dx =D] “ : ’ (3.62) ere a is commonly referred to as forgetting factor'and is typically ch i terval (0, 1). The second term in (3.61) is called the moiientum deen, and it fl value. This clearly affects the speed of the algorithm since | Gaarren 3: ge in the weights of Mapping Networks 4 Pan Fundamental Neurocomputing Concepts and Selected Neural Network Architectures and Learning Rules improves the convergence speed of the standard backpropagation algorithm by introducing stabilization in weight changes. Intuitively, according to (3.61), if the weights are to be changed in the same direction as in the previous step, the rate of change is increased. Alternatively, if the change in the current step is not in the same direction as that in the previous step, the rate of change is decreased. This type of learning significantly improves convergence in some very important cases which are handled poorly by the standard backpropaga- tion algorithm. First, if the training patterns contain some element of uncer- tainty, for example, noise, then updating with momentum provides a sort of low-pass filtering by preventing rapid changes in the direction of the weight updates. Second, this kind of behavior renders the training relatively immune to the presence of outliers or erroneous training pairs. Also, if the network is operating on a flat plateau of the error surface, the presence of momentum will increase the rate of weight change, and the speed of convergence is increased. This can be conveniently illustrated by considering the weight update equation 26) a, . Ani(k +1) = -n" ath eu yy (3.63) If the network is operating on a flat area of the error surface, the value of the gradient is not changing substantially from step to step; therefore (3.63) can be ‘approximated as 2, 3, aE, ante + 1 = Ee gyn Ee _ 92, Ee ange at ang aE, oy 2 WM pate? $e 2 at (3.64) a) a, aaah eau Because the forgetting factor a is always smaller than unity, updating with momentum increases the effective learning rate to o a we (3.65) 3.3.4 Batch Updating ‘The standard backpropagation algorithm assumes that the weights are updated for every input/output training pair. A batch-updating approach accumulates the weight corrections over several training patterns (possibly one entire epoch) before actually performing the update. The update is com- monly formed as an average’ of the corrections for each individual input output pair.

You might also like