0% found this document useful (0 votes)
12 views

2019 Normalization Gradients Are Least-Squares Residuals

Uploaded by

dio din
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

2019 Normalization Gradients Are Least-Squares Residuals

Uploaded by

dio din
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Under review as a conference paper at ICLR 2019

N ORMALIZATION G RADIENTS ARE L EAST- SQUARES


R ESIDUALS

Anonymous authors
Paper under double-blind review

A BSTRACT

Batch Normalization (BN) and its variants have seen widespread adoption in
the deep learning community because they improve the training of deep neural
networks. Discussions of why this normalization works so well remain unset-
tled. We make explicit the relationship between ordinary least squares and partial
derivatives computed when back-propagating through BN. We recast the back-
propagation of BN as a least squares fit, which zero-centers and decorrelates par-
tial derivatives from normalized activations. This view, which we term gradient-
least-squares, is an extensible and arithmetically accurate description of BN. To
further explore this perspective, we motivate, interpret, and evaluate two adjust-
ments to BN.

1 I NTRODUCTION

Training deep neural networks has become central to many machine learning tasks in computer vi-
sion, speech, and many other application areas. Ioffe & Szegedy (2015) showed empirically that
Batch Normalization (BN) enables deep networks to attain faster convergence and lower loss. Rea-
sons for the effectiveness of BN remain an open question (Lipton & Steinhardt, 2018). Existing
work towards explaining this have focused on covariate shift; Santurkar et al. (2018) described how
BN makes the loss function smoother. This work examines the details of the back-propagation of
BN, and recasts it as a least squares fit. This gradient regression zero-centers and decorrelates partial
derivatives from the normalized activations; it passes on a scaled residual during back-propagation.
Our view provides novel insight into the effectiveness of BN and several existing alternative nor-
malization approaches in the literature.

1.1 C ONTRIBUTIONS

Foremost, we draw an unexpected connection between least squares and the gradient computation
of BN. This motivates a novel view that complements earlier investigations into why BN is so
effective. Our view is consistent with recent empirical surprises regarding ordering of layers within
ResNet residual maps (He et al., 2016b) and within shake-shake regularization branches (Huang &
Narayanan, 2018). Finally, to demonstrate the extensibility of our view, we motivate and evaluate
two variants of BN from the perspective of gradient-least-squares. In the first variant, a least squares
explanation motivates the serial chaining of BN and Layer Normalization (LN) (Ba et al., 2016). In
the second variant, regularization of the least-squares leads to a version of BN that performs better
on batch size two. In both variants, we provide empirical support on CIFAR-10.
In summary, our work presents a view, which we term gradient-least-squares, through which the
back-propagation of BN and related work in a neural network can be recast as least squares regres-
sion. This regression decomposes gradients into an explained portion and a residual portion; BN
back-propagation will be shown to remove the explained portion. Hopefully, gradient-least-squares
will be broadly useful in the future design and understanding of neural network components. Figure
1 reviews normalization with batch statistics, and illustrates our main theorem.

1
Under review as a conference paper at ICLR 2019

Gradient-least-squares relates BN back-propagation performs


quantities shown in hexagons gradient-least-squares

...
...
...

Figure 1: The left figure reviews, for a single channel at a particular layer within a single batch,
notable quantities computed during the forward pass and during back-propagation of BN. Let
N N
X xi X (xi − µ)2
{xi }i=1...N be activations. Let µ = and σ 2 = . Let L be a function de-
i=1
N i=1
N
(xj − µ)
pendent on the normalized activations zi defined for each j by zj = This, along with
σ
partial derivatives, are shown in the left figure. Our work establishes a novel identity on the quanti-
ties shown
 in 
hexagons. The right figure illustrates our main result in a scatter plot, in which each
∂L
pair zi , is shown as a data point in the regression.
∂zi

2 N ORMALIZATION GRADIENTS ARE LEAST- SQUARES RESIDUALS

Consider any particular channel within which {xi } are activations to be normalized in BN moment
calculations. Ioffe & Szegedy (2015) defined BN as

(x − µ)
BN (x) = ·c+b (1)
σ
where σ, µ are batch moments, but b and c are learned per-channel parameters persistent across
batches. In BN, the batch dimension and spatial dimensions are marginalized out in the computation
of batch moments. For clarity, we consider a simplified version of BN. We ignore the variables
b and c in equation 1 responsible for a downstream channel-wise affine transformation. Ignoring
b and c is done without loss of generality, since the main observation in this work will focus on
the Gaussian normalization and remains agnostic to downstream computations. We also ignore a
numerical stability hyperparameter .
We examine back-propagation of partial derivatives through this normalization, where µ and σ are
viewed as functions of x. Notably, µ and σ are functions of each xi , and thus the division by σ is
not affine. We write the normalized output as

(x − µ)
z= (2)
σ

We review ordinary least squares of a single variable with intercept (Friedman et al., 2001).
Let gj = α + βzj + j where α and β are parameters, z and g are observations. zj and gj are entries
in z and g respectively. j are i.i.d. Gaussian residuals. We wish to fit α and β

α̂, β̂ = arg min Ej (kg − α − βzk2 ) (3)


α,β

Cov(z, g)
The least-squares problem in equation 3 is satisfied by β̂ = and α̂ = E(g) − β̂E(z)
Var(z)

2
Under review as a conference paper at ICLR 2019

When z are normalized activations and g are partial derivatives, then Ez = 0 and Var(z) = 1. In
this special case, the solution simplifies into

β̂ = Cov(z, g) (4)
α̂ = E(g) (5)
Theorem 1 (Normalization gradients are least-squares residuals). Let i ∈ {1 . . . N } be indices
N
X xi
over some set of activations {xi }. Then the moment statistics are defined by µ = and
i=1
N
N
X (xi − µ)2
σ2 = . Let L be a function dependent on the normalized activations zi defined for
i=1
N
(xj − µ)
each j by zj = . Then, the gradients of L satisfy, for all j ∈ {1, . . . , N }, the following:
σ

∂L ∂L ∂L
d
σ = − (6)
∂xj ∂zj ∂zj
where
∂L
d
= α̂ + β̂zj (7)
∂zj
N  2
X ∂L
α̂, β̂ = arg min − α − βzi (8)
α,β i=1
∂zi

Proof: Normalization gradients are least-squares residuals. The proof involves a derivation of par-
tial derivatives by repeated applications of the chain rule and rules of total derivative. Because {zi }
normalized over i has mean 0 and variance 1, the partial derivatives can be rearranged to satisfy the
single variable ordinary least squares framework.
 
∂L ∂L
Fix j. We expand as a linear combination of
∂xj ∂zi i=1...N
N
∂L ∂L ∂zj X ∂L ∂zi
= + (9)
∂xj ∂zj ∂xj ∂zi ∂xj
i6=j

∂zi
We state directly. Steps are in Appendix A under Lemma 1.
∂xj
−1 − zj zi


 if i 6= j
∂zi

 σN
= (10)
∂xj 2
 N − 1 − zj if i = j



σN
Through substitution of equations 10 into 9, we get
N 
∂L N − 1 − zj2 X ∂L −1 − zj zi

∂L
= + · (11)
∂xj ∂zj σN ∂zi σN
i6=j
N  
∂L ∂L 1 X ∂L
σ = + (−1 − zi zj ) (12)
∂xj ∂zj N i=1 ∂zi
N
! N
∂L ∂L 1 X ∂L zj X ∂L
σ = − − zi (13)
∂xj ∂zj N i=1 ∂zi N i=1 ∂zi

3
Under review as a conference paper at ICLR 2019

Noting that {zi } normalized over i has mean 0 and variance 1, we recover β̂ and α̂, in the sense of
equations 4 and 5, from equation 13.
N  
1 X ∂L ∂L
zi = Covi zi , = β̂ (14)
N i=1 ∂zi ∂zi
N  
1 X ∂L ∂L
= Ei − β̂ · 0 = α̂ (15)
N i=1 ∂zi ∂zi

Finally, we rearrange equations 15 and 14 into 13 to conclude, as desired,

∂L ∂L ∂L ∂L
d
σ = − α̂ − β̂zj = − (16)
∂xj ∂zj ∂zj ∂zj

During back-propagation of a single batch, the normalization function takes in partial derivatives
∂L ∂L
, and removes that which can be explained by least squares of against z(·) . As illustrated
∂z(·) ∂z(·)
∂L
in Figure 1, during back-propagation, the residual then divides away σ to become , the gradient
∂x(·)
for the unnormalized activations.

3 R ELATED DEEP LEARNING COMPONENTS VIEWED AS GRADIENT


CALCULATIONS

BN aims to control its output to have mean near 0 and variance near 1, normalized over the dataset;
this is related to the original explanation termed internal covariate shift (Ioffe & Szegedy, 2015).
Most existing work that improve or re-purpose BN have focused on describing the distribution of
activations.
Definition 1. In the context of normalization layers inside a neural network, activations are split
into partitions, within which means and variances are computed. We refer to these partitions as
normalization partitions.
Definition 2. Within the context of a normalization partition, we refer to the moments calculated on
the partitions as partition statistics.

Theorem 1 shows that BN has least squares fitting built into the gradient computation. Gradients
of the activations being normalized in each batch moment calculation are fit with a single-variable
with-intercept least squares model, and only a rescaled residual is kept during back-propagation.
We emphasize that the data on which the regression is trained and applied is a subset of empirical
activations within a batch, corresponding to the normalization partitions of BN.
To show extensibility, we recast several popular normalization techniques into the gradient-least-
squares view. We refer to activations arising from a single member of a particular batch as an item.
BHW C refers to dimensions corresponding to items, height, width, and channels respectively. In
non-image applications or fully connected layers, H and W are 1. BN marginalizes out the items
and spatial dimensions, but statistics for each channel are kept separate.
In the subsequent sections, we revisit several normalization methods from the perspective of the gra-
dient. Figure 2 reviews the normalization partitions of these methods, and places our main theorem
about gradient-least-squares into context.

3.1 L AYER N ORMALIZATION , I NSTANCE N ORMALIZATION , G ROUP N ORMALIZATION

Ba et al. (2016) introduced Layer Normalization (LN) in the context of large LSTM models and
recurrent networks. Only the (H, W, C) dimensions are marginalized in LN, whereas BN marginal-
izes out the (B, H, W ) dimensions. In our regression framework, the distinction can be understood

4
Under review as a conference paper at ICLR 2019

Each normalization partition performs a separate gradient-least-squares


Batch Normalization Layer Normalization Group Normalization Instance Normalization
Channel axis

Batch
axis

Height

Width

Figure 2: We review the normalization partitions of BN, LN, GN, and IN. Each normalization
partition contains a separate set of data points on which the gradient regression is performed. One
partition for each method is illustrated in blue. This figure also shows the correspondence between
a single activation and a gradient regression data point for BN.

as changing the data point partitions in which least squares are fit during back-propagation. LN
marginalizes out the channels, but computes separate statistics for each batch item. To summarize,
the regression setup in the back-propagation of LN is performed against other channels, rather than
against other batch items.
Huang & Belongie (2017) introduced Instance Normalization (IN) in the context of transferring
styles across images. IN is is closely related to contrast normalization, an older technique used in
image processing. IN emphasizes end-to-end training with derivatives passing through the moments.
Only the (H, W ) dimensions are marginalized in IN, whereas BN marginalizes (B, H, W ) dimen-
sions. In our framework, this can be understood as using fewer data points and a finer binning to fit
the least squares during back-propagation, as each batch item now falls into its own normalization
partition.
Wu & He (2018) introduced Group Normalization (GN) to improve performance on image-related
tasks when memory constrains the batch size. Similar to LN, GN also marginalizes out the
(H, W, C) dimensions in the moment computations. The partitions of GN are finer: the channels are
grouped into disjoint sub-partitions, and the moments are computed for each sub-partition. When
the number of groups is one, GN reduces to LN.
In future normalization methods that involve normalizing with respect to different normalization
partitions; such methods can pattern match with BN, LN, IN, or GN; the back-propagation can be
∂L
formulated as a least-squares fit, in which the partial derivatives at normalized activations
∂z(·)
∂L
are fitted against the normalized z(·) , and then the residual of the fit is rescaled to become .
∂x(·)
Figure 2 summarize the normalization partitions for BN, LN, IN, and GN; the figure visualizes, as an
example, a one-to-one correspondence between an activation in BN, and a data point in the gradient
regression.
Theorem 1 is agnostic to the precise nature of how activations are partitioned before being nor-
malized; thus, equation 9 applies directly to any method that partitions activations and performs
Gaussian normalization on each partition. The partitioning of BN, LN, IN, and GN are performed

5
Under review as a conference paper at ICLR 2019

in different respective manners, and each partition is individually subject to Gaussian normalization.
Thus, the gradients of BN, LN, IN, and GN are residuals of regressions in the sense of Theorem 1.

3.2 W EIGHT N ORMALIZATION

Salimans & Kingma (2016) introduced Weight Normalization (WN) in LSTMs, and noted improve-
ments in the condition number of deep networks; WN divides each weight tensor by their respective
vector 2-norms. In the view of gradient-least-squares, WN has a single-variable intercept-0 regres-
sion interpretation in back-propagation, analogous to BN. A raw weight vector v, is normalized and
c
scaled before being used as coefficient weights w = v, where c is a learned downstream linear
kvk
scaling parameter.
In this regression setup, the length normalized weights of WN are analogous to the Gaussian normal-
v w
ized activations in BN. We write that z = = , and state directly an analogous relationship
kvk c 
∂L ∂L
between each kvk and the regression on zi , . See Appendix B Lemma 2 for
∂vj ∂zi i=1...N
steps that derive the following identity: for loss L and for each component j, we have

∂L ∂L
kvk = − β̂zj (17)
∂vj ∂zj

where
2
β̂ = arg min k∇z L − βzk = (∇z L)T z (18)
β

The L2 normalization of weights in WN appears distinct from the Gaussian normalization of activa-
tions in BN; nevertheless, WN can also be recast as a least squares regression.

3.3 I DENTITY M APPINGS IN R ES N ET, AND S HAKE -S HAKE R ES N EXT R EGULARIZATION

Improved Residual Mapping with BN first


BN ReLU Conv BN ReLU Conv

Gradients returning to the trunk are


least squares residuals
Trunk Addition

Original ResNet Residual Mapping


Conv BN ReLU Conv BN

Trunk Addition ReLU

Figure 3: This figure illustrates the original (He et al., 2016a) and improved (He et al., 2016b)
residual mappings in ResNets. Arrows point in the direction of the forward pass. Dotted lines
indicate that gradients are zero-centered and decorrelated with respect to downstream activations in
the residual mapping. The improved ordering has BN coming first, and thus constrains that gradients
of the residual map must be decorrelated with respect to some normalized activations inside the
residual mapping.

An update to the popular ResNet architecture showed that the network’s residual mappings can
be dramatically improved with a new ordering (He et al., 2016b). The improvement moved BN
operations into early positions and surprised the authors; we support the change from the perspective
of gradient-least-squares. Figure 3 reviews the precise ordering in the two versions. Huang &
Narayanan (2018) provides independent empirical support for the BN-early order, in shake-shake
regularization (Gastaldi, 2017) architectures. We believe that the surprise arises from a perspective
that views BN only as a way to control the distribution of activations; one would place BN after
a sequence of convolution layers. In the gradient-least-squares perspective, the first layer of each
residual mapping is also the final calculation for these gradients before they are added back into

6
Under review as a conference paper at ICLR 2019

Table 1: BN plus LN final validation performance (ResNet-34-v2, batch size 128)


Normalization CIFAR-10 Accuracy CIFAR-10 Cross Entropy

BN, LN 0.9259 0.3087


LN, BN 0.9245 0.3389
BN (Ioffe & Szegedy, 2015) 0.9209 0.3969
LN (Ba et al., 2016) 0.9102 0.3548

the main trunk. The improved residual branch constrains the gradients returning from the residual
mappings to be zero-centered and decorrelated with respect to some activations inside the branch.
We illustrate this idea in Figure 3.

4 N ORMALIZATION APPROACHES MOTIVATED BY LEAST SQUARES

Gradient-least-squares views back-propagation in deep neural networks as a solution to a regression


problem. Thus, formulations and ideas from a regression perspective would motivate improvements
and alternatives to BN. We pursue and evaluate two of these ideas.

4.1 BN AND LN AS TWO - STEP GRADIENT REGRESSION

BN and LN are similar to each other, but they normalize over different partitioning of the activations;
in back-propagation, the regressions occur respectively with respect to different partitions of the
activations. Suppose that a BN and a LN layer are chained serially in either order. This results in a
two-step regression during back-propagation; in reversed order, the residual from the first regression
is further explained by a second regression on a different partitioning. In principle, whether this
helps would depend on the empirical characteristics of the gradients encountered during training.
The second regression could further decorrelate partial gradients from activations. Empirically, we
show improvement in a reference ResNet-34-v2 implementation on CIFAR-10 relative to BN with
batch size 128. In all cases, only a single per-channel downstream affine transformation is applied,
after both normalization layers, for consistency in the number of parameters. See table 1 for CIFAR-
10 validation performances. We kept all default hyperparameters from the reference implementation:
learning schedules, batch sizes, and optimizer settings.

4.2 A DDRESSING SMALL BATCHES WITH LEAST- SQUARES REGULARIZATION

BN performs less well on small batches (Ioffe, 2017). Gradient-least-squares interprets this as
gradient regressions failing on correlated data, an issue typically addressed by regularization. We
pursue this idea to recover some performance on small batches by use of regularization. Our reg-
ularization uses streaming estimates of past gradients to create virtual data in the regression. This
performed better than standard BN on the same batch size, but we did not recover the performance
of large batches; this is consistent with the idea that regularization could not in general compensate
for having much less data. See Appendix C for CIFAR-10 validation performances.

5 L IMITATIONS AND R ELATED W ORK

5.1 S WITCH N ORMALIZATION

Luo et al. (2018a) introduced Switch Normalization (SwN), a hybrid strategy for combining moment
calculations from LN, BN, and IN. SwN uses learnable scalar logits λk for k ∈ Ω = {BN, IN, LN }
exp(λk )
with corresponding softmax weighting activations wk = P to rescale the contributions
k0 exp(λk )
0

to the batch mean for each normalization scheme. It uses an analogous set of parameters λ0k and
activations wk0 for variances. We sketch the back-propagation of a simplified version of SN in
the perspective of gradient-least-squares. We ignore both the division  and downstream affine

7
Under review as a conference paper at ICLR 2019

z → c · z + b. The normalization calculation inside SwN can be written as:


P
xbhwc − k∈Ω wk µbhwc,k
zbhwc = qP (19)
0 2
k∈Ω wk σbhwc,k

where Ω = {BN, LN, IN }. There is potentially a unique mean and variance used for each acti-
vation. Equation 19 bears similarities to the setup in Theorem 1, but we leave unresolved whether
there is a gradient-least-squares regression interpretation for SN.

5.2 D ECORRELATED BATCH N ORMALIZATION AND S PECTRAL N ORMALIZATION

Decorrelated Batch Normalization (DBN) (Huang et al., 2018) is a generalization of BN that per-
forms Mahalanobis ZCA whitening to decorrelate the channels, using differentiable operations. On
some level, the matrix gradient equation resemble the least squares formulation in Theorem 1.
Spectral Normalization (SpN) (Miyato et al., 2018) is an approximate spectral generalization of WN.
For DBN and SpN, the regression interpretations remain unresolved.

5.3 R ELATED W ORK

BN has been instrumental in the training of deeper networks (Ioffe & Szegedy, 2015). Subsequent
work resulted in Batch Renormalization (Ioffe, 2017), and further emphasized the importance of
passing gradients through the minibatch moments, instead of a gradient-free exponential running
average. In gradient-least-squares, use of running accumulators in the training forward pass would
stop the gradients from flowing through them during training, and there would be no least-squares.
He et al. (2016b) demonstrate empirically the unexpected advantages of placing BN early in residual
mappings of ResNet.
Santurkar et al. (2018) showed that BN makes the loss landscape smoother, and gradients more
predictable across stochastic gradient descent steps. Balduzzi et al. (2017) found evidence that
spatial correlation of gradients explains why ResNet outperforms earlier designs of deep neural
networks. Kohler et al. (2018) proved that BN accelerates convergence on least squares loss, but did
not consider back-propagation of BN as a least squares residual. Luo et al. (2018b) has recast BN
as a stochastic process, resulting in a novel treatment of regularization.

6 D ISCUSSION , AND F UTURE W ORK

This work makes explicit how BN back-propagation regresses partial derivatives against the normal-
ized activations and keeps the residual. This view, in conjunction with the empirical success of BN,
suggests an interpretation of BN as a gradient regression calculation. BN and its variants decorrelate
and zero-center the gradients with respect to the normalized activations. Subjectively, this can be
viewed as removing systematic errors from the gradients. Our view also support empirical results in
literature preferring early BN placement within neural network branches.
Leveraging gradient-least-squares considerations, we ran two sets of normalization experiments,
applicable to large batch and small batch settings. Placing a LN layer either before or after BN can
be viewed as two-step regression that better explains the residual. We show empirically on CIFAR-
10 that BN and LN together are better than either individually. In a second set of experiments, we
address BN’s performance degradation with small batch size. We regularize the gradient regres-
sion with streaming gradient statistics, which empirically recovers some performance on CIFAR-10
relative to basic BN, on batch size two.
Why do empirical improvements in neural networks with BN keep the gradient-least-squares residu-
als and drop the explained portion? We propose two open approaches for investigating this in future
work. A first approach focuses on how changes to the gradient regression result in different formu-
lations; the two empirical experiments in our work contribute to this. A second approach examines
the empirical relationships between gradients of activations evaluated on the same parameter val-
ues; we can search for a shared noisy component arising from gradients in the same normalization
partition. Suppose that the gradient noise correlates with the activations – this is plausible because

8
Under review as a conference paper at ICLR 2019

the population of internal activations arise from using shared weights – then normalizations could
be viewed as a layer that removes systematic noise during back-propagation.
In conclusion, we have presented a novel view that reorganizes the back-propagation of BN as a least
squares residual calculation. This view generates novel descriptions of normalization techniques
related to BN, and comments on the ordering of layers inside the residual mappings of ResNet. This
view is extensible and will motivate novel designs of neural network components in future work.

R EFERENCES
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian
McWilliams. The shattered gradients problem: If resnets are the answer, then what is the ques-
tion? In Proceedings of the 34th International Conference on Machine Learning, ICML 2017,
Sydney, NSW, Australia, 6-11 August 2017, pp. 342–350, 2017.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, vol-
ume 1. Springer series in statistics New York, NY, USA:, 2001.
Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European conference on computer vision, pp. 630–645. Springer, 2016b.
Che-Wei Huang and Shrikanth S Narayanan. Normalization before shaking toward learning sym-
metrically distributed representation without margin in speech emotion recognition. arXiv
preprint arXiv:1808.00876, 2018.
Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 791–800, 2018.
Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance nor-
malization. In ICCV, pp. 1510–1519, 2017.
Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized
models. In Advances in Neural Information Processing Systems, pp. 1945–1953, 2017.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas
Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint
arXiv:1805.10694, 2018.
Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. arXiv
preprint arXiv:1807.03341, 2018.
Ping Luo, Jiamin Ren, and Zhanglin Peng. Differentiable learning-to-normalize via switchable
normalization. arXiv preprint arXiv:1806.10779, 2018a.
Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Understanding regularization in batch
normalization. arXiv preprint arXiv:1809.00846, 2018b.
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization
for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accel-
erate training of deep neural networks. In Advances in Neural Information Processing Systems,
pp. 901–909, 2016.

9
Under review as a conference paper at ICLR 2019

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch
normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint
arXiv:1805.11604, 2018.

Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision
(ECCV), September 2018.

A PARTIAL DERIVATIVES OF THE NORMALIZATION FUNCTION

Lemma 1. Consider the Gaussian normalization function that maps all {xi } into corresponding
N N
1 X 1 X
{zi }, for i ∈ 1 . . . N . We define µ = xi and σ 2 = (xi − µ)2 with σ > 0. The
N j=1 N j=1
xi − µ
normalized activations zi are defined as zi = . Then, the partial derivatives satisfy
σ

−1 − zi zj


 if j 6= i
∂zj  σN

= (20)
∂xi 2
 N − 1 − zi if j = i



σN

∂zj
Proof. In deriving , we will treat the cases of when j 6= i and when j = i separately. We start
∂xi
by examining intermediate quantities of interest as a matter of convenience for later use. We define
helper quantities ui = xi − µ. Note that each uj depends on all of xi via µ. Next, we write out
useful identities

N
1 X
uj = 0 (21)
N j=1
∂µ 1
= (22)
∂xi N
−1


 if i 6= j
N

∂ui 
= (23)
∂xj N − 1

if i = j


N

We prepare to differentiate with rule of total derivative:

N
1 X 2
σ2 = u (24)
N j=1 j
N
∂(σ 2 ) 1 X ∂(σ 2 ) ∂uj
= (25)
∂xi N j=1 ∂uj ∂xi

∂σ
Making use of equations 21, 22, 23 and 25, We simplify for any i as follows.
∂xi

10
Under review as a conference paper at ICLR 2019

∂σ ∂σ ∂(σ 2 )
= (26)
∂xi ∂(σ 2 ) ∂xi
N
∂σ 1 X ∂(σ 2 ) ∂uj
= (27)
∂(σ 2 ) N j=1 ∂uj ∂xi
 X N
1 2 −1 ∂uj
= (σ ) 2 (2uj ) (28)
2N j=1
∂xi
  
N
1  X ∂uj  ∂ui 
= (2uj ) + 2ui (29)
2N σ ∂xi ∂xi
j6=i
  
N
1  X −1  N − 1
= (uj ) + ui (30)
Nσ N N
j6=i
  
 N  
1  (uj ) −1  + ui 
X  
= (31)
Nσ 
 j=1 N  

| {z }
=0
1
= ui (32)

∂σ xi − µ
= (33)
∂xi σN

∂zj
We apply the quotient rule on when j 6= i, then substitute equation 33
∂xi
 
∂zj ∂ xj − µ
= (34)
∂xi ∂xi σ
∂µ ∂σ
−σ ∂xi
− (xj − µ) ∂x i
= (35)
σ2
i −µ
−σ
− (xj − µ) xσN
= N (36)
σ2
−1 − (xj − µ) xiσ−µ
2
= (37)

∂zj −1 − zi zj
= (38)
∂xi Nσ
Similarly, when i = j,
 
∂zi ∂ xi − µ
= (39)
∂xi ∂xi σ
∂µ ∂σ
σ(1 − ∂xi ) − (xi − µ) ∂x i
= (40)
σ2
i −µ
σ − N − (xi − µ) xσN
−σ
= 2
(41)
σ
N − 1 − (xj − µ) xiσ−µ
2
= (42)

∂zi N − 1 − zi2
= (43)
∂xi Nσ

11
Under review as a conference paper at ICLR 2019

B W EIGHT NORMALIZATION RECAST AS GRADIENT REGRESSION


We show steps to recast the gradient of WN as regression. In WN, A raw weight vector v is normal-
ized and scaled and before being used as coefficient weights. Salimans & Kingma (2016) introduced
c
their transformation as w = v, where c is a learned downstream linear scaling parameter. In
kvk
v w
our regression setup, we ignore c. We define z = = . We derive the analogous relationship
 kvk c
∂L ∂L
between each kvk and the regression on zi , .
∂vj ∂zi i=1...N
2
Note that intercept-0 single variable least squares β̂ = arg min k∇z L − βzk has the solution
β
(∇z L)T z
β̂ = = (∇z L)T z, since kzk = 1.
kzk2
v
Lemma 2. Let v ∈ RN be weights, and let z = . Let L be a loss function dependent on z.
kvk
Then, for each component j, we have
∂L ∂L
kvk = − β̂zj (44)
∂vj ∂zj
2
where β̂ = arg min k∇z L − βzk = (∇z L)T z.
β

Proof. Salimans & Kingma (2016) wrote their gradients as follows:


∇w L · v
∇c L = (45)
kvk
c c∇c L
∇v L = ∇w L − v (46)
kvk kvk2
v w
In our notation where z = = , we have
kvk c
∇z L = c∇w L (47)
To recover β̂ We substitute equation 45, and then subsequently equation 47 into equation 46
 
c c∇w L · v
∇v L = ∇w L − v (48)
kvk kvk3
 
1 (c∇w L · v)v
∇v L = c∇w L − (49)
kvk kvk2
(∇z L · v)v
kvk∇v L = ∇z L − (50)
kvk2
kvk∇v L = ∇z L − (∇z L · z) z (51)
| {z }
β̂

The result follows: for loss L and for each component j, we have
∂L ∂L
kvk = − β̂zj (52)
∂vj ∂zj

C A DDRESSING SMALL BATCHES WITH LEAST- SQUARES REGULARIZATION


Let b be an index for different batches; let X refer to data inputs into the neural network (for exam-
ple image and class label) within a single step of training, and let X (b) refers to the value of all data

12
Under review as a conference paper at ICLR 2019

Table 2: Streaming regularization is less affected by small batch sizes (ResNet-34-v2, batch size 2)
Normalization CIFAR-10 Accuracy CIFAR-10 Cross Entropy

Our Best Hyperparameter 0.9091 0.3627


Our Worst Hyperparameter 0.9005 0.4086
BN (Ioffe & Szegedy, 2015) 0.8903 0.4624
Renorm (Ioffe, 2017) 0.9033 0.3823
Identity 0.9086 0.6934 (0.4229 at best point)

 b. In our work,
inputs in batch  we keep trackof am exponential
 running estimates across batches,
∂L ∂L
α̂∗ ∼ Eb Ei and β̂ ∗ ∼ Eb Ei zi that marginalize the (B, H, W ) di-
∂zi X=X (b) ∂zi X=X (b)
mensions into accumulators of shape C. The b subscript of the outer expectation is slightly abusive
notation indicating that α̂∗ and β̂ ∗ are running averages across recent batches with momentum as
a hyperparameter that determines the weighting. We regularize the gradient regression with virtual
activations and virtual gradients, defined as follows. We append two virtual batch items, broadcast
to an appropriate shape, x+ = µb + σb and x− = µb − σb . Here, µb and σb are batch statistics of
the real activations. The concatenated tensor undergoes standard BN, which outputs the usual {zi }
for the real activations, but z+ = 1 and z− = −1 for the virtual items. The z+ and z− do not affect
the feed forward calculations, but they receive virtual gradients during back-propagation:

∂L
= α̂∗ + β̂ ∗ (53)
∂z+
∂L
= α̂∗ − β̂ ∗ (54)
∂z−
   
∂L ∂L ∂L
Virtual data z+ , and z− , regularizes the gradient-least-squares regression.
∂z+ ∂z− ∂z+
∂L
and eventually modify the gradients received by the real xi activations. The virtual data can
∂z−
be weighted with hyperparameters. In our experiments, we see improvements, robust to a hyper-
parameter cross-product search over the weightings and the momentum for α̂∗ and β̂ ∗ . The mo-
mentum for α̂∗ and β̂ ∗ were in {.997, .5} and the virtual item weights were in {2i−1 }i∈{0,1,2,3} .
The performance of larger batches are not recovered; regularized regression could not be reasonably
expected to recover the performance of regressing with more data. See table 2 for final validation
performances with a reference Tensorflow ResNet-34-v2 implementation on batch size of two. The
baseline evaluation with identity (no normalization) experienced noticeable overfitting in terms of
1
cross entropy but not accuracy. The base learning rate was multiplied by 64 relative to the baseline
rate used in runs with batch size 128.

13

You might also like