0% found this document useful (0 votes)
22 views

Batch Normalization: Motivation

Batch normalization addresses the problem of internal covariate shift in deep learning models by normalizing layer inputs through centering and rescaling. It works by calculating the mean and variance of a batch of inputs and then normalizing each input using those statistics and learned scale and shift parameters. This normalization accelerates learning by smoothing the optimization function.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Batch Normalization: Motivation

Batch normalization addresses the problem of internal covariate shift in deep learning models by normalizing layer inputs through centering and rescaling. It works by calculating the mean and variance of a batch of inputs and then normalizing each input using those statistics and learned scale and shift parameters. This normalization accelerates learning by smoothing the optimization function.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Batch Normalization

Motivation
Batch normalization was originally developed to address the problem of “internal covariate shift”.
The randomness of initial weight values and the randomness of the batch selection can create
situations unfavorable for training. This may become worse in a deep learning model, because
small changes in shallower hidden layers will be amplified as they propagate within the network,
resulting in significant shift in deeper hidden layers.
It was found experimentally that batch normalization accelerates the speed of learning, but the
current view is that this acceleration is not related to an improvement in the internal covariate shift.
Instead, the current view is that batch normalization “smooths” the function to be optimized.

The idea
The idea is to “normalize” the input to a layer by re-centering and re-scaling. Let B be a batch of
training data, containing the m examples x1 . . . xm . (xi is taken as a scalar. For multidimensional
data the batch normalization is applied separately in each dimension.) Then batch normalization
replaces each xi with yi defined as follows:

yi = γxi + β,

where:
1 X 1 X xi − µB
µB = xi , vB = (xi − µb )2 , xi = √
m m vB + 
i i

The parameters γ, β are learned during the back propagation optimization.  is a small positive
value that guards against numerical instability that may occur if vB is too small.

You might also like