Batch Normalization
Batch Normalization
performance, and stability of artificial neural networks. Batch normalization was introduced in a
2015 paper.[1][2] It is used to normalize the input layer by re-centering and re-scaling.[3]
While the effect of batch normalization is evident, the reasons behind its effectiveness remain under
discussion. It was believed that it can mitigate the problem of internal covariate shift, where
parameter initialization and changes in the distribution of the inputs of each layer affect the learning
rate of the network.[1] Recently, some scholars have argued that batch normalization does not
reduce internal covariate shift, but rather smooths the objective function, which in turn improves
the performance.[4] Others sustain that batch normalization achieves length-direction decoupling,
and thereby accelerates neural networks.[5]
Each layer of a neural network has inputs with a corresponding distribution, which is affected during
the training process by the randomness in the parameter initialization and the randomness in the
input data. The effect of these sources of randomness on the distribution of the inputs to internal
layers during training is described as internal covariate shift. Although a clear-cut precise definition
seems to be missing, the phenomenon observed in experiments is the change on means and
variances of the inputs to internal layers during training.
Batch normalization was initially proposed to mitigate internal covariate shift. [1] During the training
stage of networks, as the parameters of the preceding layers change, the distribution of inputs to
the current layer changes accordingly, such that the current layer needs to constantly readjust to
new distributions. This problem is especially severe for deep networks, because small changes in
shallower hidden layers will be amplified as they propagate within the network, resulting in
significant shift in deeper hidden layers. Therefore, the method of batch normalization is proposed
to reduce these unwanted shifts to speed up training and to produce more reliable models.
Besides reducing internal covariate shift, batch normalization is believed to introduce many other
benefits. With this additional operation, the network can use higher learning rate without vanishing
or exploding gradients. Furthermore, batch normalization seems to have a regularizing effect such
that the network improves its generalization properties, and it is thus unnecessary to use dropout to
mitigate overfitting. It has been observed also that with batch norm the network becomes more
robust to different initialization schemes and learning rates.
Procedures
In a neural network, batch normalization is achieved through a normalization step that fixes the
means and variances of each layer's inputs. Ideally, the normalization would be conducted over the
entire training set, but to use this step jointly with stochastic optimization methods, it is impractical
to use the global information. Thus, normalization is restrained to each mini-batch in the training
process.
Use B to denote a mini-batch of size m of the entire training set. The empirical mean and variance of
B could thus be denoted as
For a layer of the network with d-dimensional input, each dimension of its input is then
normalized (i.e. re-centered and re-scaled) separately,
epsilon is added in the denominator for numerical stability and is an arbitrarily small constant. The
resulting normalized activation have zero mean and unit variance, if epsilon is not taken into
account. To restore the representation power of the network, a transformation step then follows as
where the parameters x and y are subsequently learned in the optimization process.