Difference Between Local Response Normalization and Batch Normalization - by Aqeel Anwar - Towards Data Science
Difference Between Local Response Normalization and Batch Normalization - by Aqeel Anwar - Towards Data Science
Save
Why Normalization?
Normalization has become important for deep neural networks that compensate for
the unbounded nature of certain activation functions such as ReLU, ELU, etc. With
these activation functions, the output layers are not constrained within a bounded
range (such as [-1,1] for tanh), rather they can grow as high as the training allows it.
To limit the unbounded activation from increasing the output layer values,
normalization is used just before the activation function. There are two common
normalization techniques used in deep neural networks and are often
misunderstood by beginners. In this tutorial, a detailed explanation of both the
normalization techniques will be discussed highlighting their key differences.
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 1/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
Inter-Channel LRN: This is originally what the AlexNet paper used. The
neighborhood defined is across the channel. For each (x,y) position, the
normalization is carried out in the depth dimension and is given by the following
formula
where i indicates the output of filter i, a(x,y), b(x,y) the pixel values at (x,y) position
before and after normalization respectively, and N is the total number of channels.
The constants (k,α,β,n) are hyper-parameters. k is used to avoid any singularities
(division by zero), α is used as a normalization constant, while β is a contrasting
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 2/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
constant. The constant n is used to define the neighborhood length i.e. how many
Get unlimited access Open in app
consecutive pixel values need to be considered while carrying out the
normalization. The case of (k,α, β, n)=(0,1,1,N) is the standard normalization). In
the figure above n is taken to be to 2 while N=4.
Let’s have a look at an example of Inter-channel LRN. Consider the following figure
Different colors denote different channels and hence N=4. Lets take the hyper-
parameters to be (k,α, β, n)=(0,1,1,2). The value of n=2 means that while calculating
the normalized value at position (i,x,y), we consider the values at the same position
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 3/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
for the previous and next filter i.e (i-1, x, y) and (i+1, x, y). For (i,x,y)=(0,0,0) we
Get unlimited access Open in app
have value(i,x,y)=1, value(i-1,x,y) doesn’t exist and value(i+,x,y)=1. Hence
normalized_value(i,x,y) = 1/(¹²+¹²) = 0.5 and can be seen in the lower part of the
figure above. The rest of the normalized values are calculated in a similar way.
where (W,H) are the width and height of the feature map (for example in the figure
above (W,H) = (8,8)). The only difference between Inter and Intra Channel LRN is
the neighborhood for normalization. In Intra-channel LRN, a 2D neighborhood is
defined (as opposed to the 1D neighborhood in Inter-Channel) around the pixel
under-consideration. As an example, the figure below shows the Intra-Channel
normalization on a 5x5 feature map with n=2 (i.e. 2D neighborhood of size
(n+1)x(n+1) centered at (x,y)).
Batch Normalization:
Batch Normalization (BN) is a trainable layer normally used for addressing the
issues of Internal Covariate Shift (ICF) [1]. ICF arises due to the changing
distribution of the hidden neurons/activation. Consider the following example of
binary classification where we need to classify roses and no-roses
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 4/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
Roses vs No-roses classification. The feature map plotted on the right have different distributions for two
different batch sampled from the dataset [1]
Say we have trained a neural network, and now we select two significantly different
looking batches from the dataset for inference (as shown above). If we do a forward
pass with these two batches and plot the feature space of a hidden layer (deep in the
network) we will see a significant shift in the distribution as seen on the right-hand
side of the figure above. This is called the Covariate shift of the input neurons. What
impact does this have during training? During training, if we select batches that
belong to different distributions then it slows down the training since for a given
batch it tries to learn a certain distribution, which is different for the next batch.
Hence it keeps on bouncing back and forth between distributions until it converges.
This Covariate Shift can be mitigated by making sure that the members within a
batch do not belong to the same/similar distribution. This can be done by randomly
selecting images for batches. Similar Covariate Shift exists for hidden neurons. Even
if the batches are randomly selected, the hidden neuron can end up having a
certain distribution which slows down the training. This Covariate shift for hidden
layers is called Internal Covariate Shift. The problem is that we can’t directly control
the distribution of the hidden neurons, as we did for input neurons, because it
keeps on changing as training updates the training parameters. Batch
Normalization helps mitigate this issue.
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 5/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
Normalize the mini-batch by subtracting the mean and dividing with variance
3. Feed this scaled and shifted normalized mini-batch to the activation function.
The normalization is carried out for each pixel across all the activations in a batch.
Consider the figure below. Let us assume we have a mini-batch of size 3. A hidden
layer produces an activation of size (C,H,W) = (4,4,4). Since the batch size is 3, we
will have 3 of such activations. Now for each pixel in the activation (i.e. for each
4x4x4=64 pixel), we will normalize it by finding the mean and variance of this pixel
position in all the activations as shown in the left part of the figure below. Once the
mean and variance are found, we will subtract the mean from each of the
activations and divide it with the variance. The right part of the figure below depicts
this. The subtraction and division are carried out point-wise. (if you are used to
MATLAB, the division is dot-division ./ ).
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 6/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
The reason for step 2 i.e. scaling and shifting is to let the training decide whether we
even need the normalization or not. There are some cases when not having
normalization may yield better results. So instead of selecting beforehand whether
to include a normalization layer or not, BN lets the training decide it. When Gamma
= sigma_B and Beta = u_B, no normalization is carried out, and original activations
are restored. A really good video tutorial on BN by Andrew Ng can be found here
Comparison:
LRN has multiple directions to perform normalization across (Inter or Intra
Channel), on the other hand, BN has only one way of being carried out (for each
pixel position across all the activations). The table below compares the two
normalization techniques.
References:
[1] https://www.learnopencv.com/batch-normalization-in-deep-networks/
[2] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep
network training by reducing internal covariate shift.” arXiv preprint
arXiv:1502.03167 (2015).
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 7/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
Bonus:
Get unlimited access Open in app
Compact cheat sheets for this topic and many other important topics in Machine
Learning can be found in the link below
If this article was helpful to you, feel free to clap, share and respond to it. If want to
learn more about Machine Learning and Data Science, follow me @Aqeel Anwar or
connect with me on LinkedIn.
450 8
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 8/9
11/11/22, 7:10 AM Difference between Local Response Normalization and Batch Normalization | by Aqeel Anwar | Towards Data Science
https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac 9/9