Batch Normalisation
Batch Normalisation
NORMALIZATION
INTRODUCTION
This term might be negligible if is small, or might be exponentially large if the weights
on layers 3 through ’ l’ are greater than 1
This makes it very hard to choose an appropriate learning rate, because the effects of an update to the
parameters for one layer depends so strongly on all of the other layers.
Second-order optimization algorithms address this issue by computing an update that takes these second-order
interactions into account, but we can see that in very deep networks, even higher-order interactions can be
significant.
Even second-order optimization algorithms are expensive and usually require numerous approximations that
prevent them from truly accounting for all significant second-order interactions.
Building an n-th order optimization algorithm for n > 2 thus seems hopeless.
Batch normalization provides an elegant way of
reparametrizing almost any deep network.
This means that the gradient will never propose an operation that acts simply to increase the standard deviation or mean of hi;
the normalization operations remove the effect of such an action and zero out its comthe gradient.ponent in
This was a major innovation of the batch normalization approach. Previous approaches had involved adding penalties to the
cost function to encourage units to have normalized activation statistics or involved intervening to renormalize unit statistics
after each gradient descent step.
The former approach usually resulted in imperfect normalization and the latter usually resulted in significant wasted time as the
learning algorithm repeatedly proposed changing the mean and variance and the normalization step repeatedly undid this
change.
Batch normalization reparametrizes the model to make some units always be standardized by definition, deftly sidestepping
both problems.
At test time, μ and σ may be replaced
by running averages that were
collected
However, hl−1 will no longer have zero mean and unit variance.
For almost any update to the lower layers, ˆhl−1 will remain a unit
Gaussian.
The output ˆ y may then be learned as a simple linear function ˆ y = wlˆh l−1. Learning in
this model is now very simple because the parameters at the lower layers simply do not
have an effect in most cases; their output is always renormalized to a unit Gaussian.
Changing one of the lower layer weights to 0 can make the output become degenerate,
and changing the sign of one of the lower weights can flip the relationship between ˆ hl−1
and y.
These situations are very rare. Without normalization, nearly every update would have an
extreme effect on the statistics of hl−1.
Batch normalization has thus made this model significantly easier to learn.
In this example, the ease of learning of course came at the cost of making the lower layers
useless.
the lower layers no longer have any harmful effect, but they also
no longer have any beneficial effect.
This is because we have normalized out the first and second order
statistics, which is all that a linear network can influence.
LINEAR
EXAMPLE In a deep neural network with nonlinear activation functions, the
lower layers can perform nonlinear transformations of the data, so
they remain useful.
The answer is that the new parametrization can represent the same family of
functions of the input as the old parametrization, but the new parametrization has
different learning dynamics. In the old parametrization, the mean of H was
determined by a complicated interaction between the parameters in the layers
below H.