Batch Normalization

1) The document discusses batch normalization, a technique to accelerate deep network training by reducing internal covariate shift. 2) Internal covariate shift refers to the changing distribution of network activations during training as parameters change, which slows down training. 3) Batch normalization works by normalizing the layer inputs during each mini-batch, which helps keep their distributions stable and prevents layers from having to constantly adapt, allowing higher learning rates and faster convergence.

Uploaded by

Ali Hssan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Batch Normalization

Uploaded by

Ali Hssan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift

Sergey Ioffe Christian Szegedy
Google Inc., [email protected] Google Inc., [email protected]

Abstract Using mini-batches of examples, as opposed to one exam-

ple at a time, is helpful in several ways. First, the gradient
arXiv:1502.03167v2 [cs.LG] 13 Feb 2015

Training Deep Neural Networks is complicated by the fact of the loss over a mini-batch is an estimate of the gradient
that the distribution of each layer’s inputs changes during over the training set, whose quality improves as the batch
training, as the parameters of the previous layers change. size increases. Second, computation over a batch can be
This slows down the training by requiring lower learning much more efficient than m computations for individual
rates and careful parameter initialization, and makes it no- examples, due to the parallelism afforded by the modern
toriously hard to train models with saturating nonlineari- computing platforms.
ties. We refer to this phenomenon as internal covariate While stochastic gradient is simple and effective, it
shift, and address the problem by normalizing layer in- requires careful tuning of the model hyper-parameters,
puts. Our method draws its strength from making normal- specifically the learning rate used in optimization, as well
ization a part of the model architecture and performing the as the initial values for the model parameters. The train-
normalization for each training mini-batch. Batch Nor- ing is complicated by the fact that the inputs to each layer
malization allows us to use much higher learning rates and are affected by the parameters of all preceding layers – so
be less careful about initialization. It also acts as a regu- that small changes to the network parameters amplify as
larizer, in some cases eliminating the need for Dropout. the network becomes deeper.
Applied to a state-of-the-art image classification model, The change in the distributions of layers’ inputs
Batch Normalization achieves the same accuracy with 14 presents a problem because the layers need to continu-
times fewer training steps, and beats the original model ously adapt to the new distribution. When the input dis-
by a significant margin. Using an ensemble of batch- tribution to a learning system changes, it is said to experi-
normalized networks, we improve upon the best published ence covariate shift (Shimodaira, 2000). This is typically
result on ImageNet classification: reaching 4.9% top-5 handled via domain adaptation (Jiang, 2008). However,
validation error (and 4.8% test error), exceeding the ac- the notion of covariate shift can be extended beyond the
curacy of human raters. learning system as a whole, to apply to its parts, such as a
sub-network or a layer. Consider a network computing
1 Introduction ℓ = F2 (F1 (u, Θ1 ), Θ2 )

Deep learning has dramatically advanced the state of the where F1 and F2 are arbitrary transformations, and the
art in vision, speech, and many other areas. Stochas- parameters Θ1 , Θ2 are to be learned so as to minimize
tic gradient descent (SGD) has proved to be an effec- the loss ℓ. Learning Θ2 can be viewed as if the inputs
tive way of training deep networks, and SGD variants x = F1 (u, Θ1 ) are fed into the sub-network
such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the ℓ = F2 (x, Θ2 ).
art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss For example, a gradient descent step
m
1 X
N
α X ∂F2 (xi , Θ2 )
Θ = arg min ℓ(xi , Θ) Θ2 ← Θ2 −
Θ N m i=1 ∂Θ2
i=1

where x1...N is the training data set. With SGD, the train- (for batch size m and learning rate α) is exactly equivalent
ing proceeds in steps, and at each step we consider a mini- to that for a stand-alone network F2 with input x. There-
batch x1...m of size m. The mini-batch is used to approx- fore, the input distribution properties that make training
imate the gradient of the loss function with respect to the more efficient – such as having the same distribution be-
parameters, by computing tween the training and test data – apply to training the
1 ∂ℓ(xi , Θ) sub-network as well. As such it is advantageous for the
. distribution of x to remain fixed over time. Then, Θ2 does
m ∂Θ

1
not have to readjust to compensate for the change in the 2 Towards Reducing Internal
distribution of x.
Covariate Shift
Fixed distribution of inputs to a sub-network would We define Internal Covariate Shift as the change in the
have positive consequences for the layers outside the sub- distribution of network activations due to the change in
network, as well. Consider a layer with a sigmoid activa- network parameters during training. To improve the train-
tion function z = g(W u + b) where u is the layer input, ing, we seek to reduce the internal covariate shift. By
the weight matrix W and bias vector b are the layer pa- fixing the distribution of the layer inputs x as the training
1
rameters to be learned, and g(x) = 1+exp(−x) . As |x| progresses, we expect to improve the training speed. It has
′
increases, g (x) tends to zero. This means that for all di- been long known (LeCun et al., 1998b; Wiesler & Ney,
mensions of x = W u+b except those with small absolute 2011) that the network training converges faster if its in-
values, the gradient flowing down to u will vanish and the puts are whitened – i.e., linearly transformed to have zero
model will train slowly. However, since x is affected by means and unit variances, and decorrelated. As each layer
W, b and the parameters of all the layers below, changes observes the inputs produced by the layers below, it would
to those parameters during training will likely move many be advantageous to achieve the same whitening of the in-
dimensions of x into the saturated regime of the nonlin- puts of each layer. By whitening the inputs to each layer,
earity and slow down the convergence. This effect is we would take a step towards achieving the fixed distri-
amplified as the network depth increases. In practice, butions of inputs that would remove the ill effects of the
the saturation problem and the resulting vanishing gradi- internal covariate shift.
ents are usually addressed by using Rectified Linear Units
We could consider whitening activations at every train-
(Nair & Hinton, 2010) ReLU (x) = max(x, 0), careful
ing step or at some interval, either by modifying the
initialization (Bengio & Glorot, 2010; Saxe et al., 2013),
network directly or by changing the parameters of the
and small learning rates. If, however, we could ensure
optimization algorithm to depend on the network ac-
that the distribution of nonlinearity inputs remains more
tivation values (Wiesler et al., 2014; Raiko et al., 2012;
stable as the network trains, then the optimizer would be
Povey et al., 2014; Desjardins & Kavukcuoglu). How-
less likely to get stuck in the saturated regime, and the
ever, if these modifications are interspersed with the op-
training would accelerate.
timization steps, then the gradient descent step may at-
tempt to update the parameters in a way that requires
We refer to the change in the distributions of internal the normalization to be updated, which reduces the ef-
nodes of a deep network, in the course of training, as In- fect of the gradient step. For example, consider a layer
ternal Covariate Shift. Eliminating it offers a promise of with the input u that adds the learned bias b, and normal-
faster training. We propose a new mechanism, which we izes the result by subtracting the mean of the activation
call Batch Normalization, that takes a step towards re- computed over the training data: x b = x − E[x] where
ducing internal covariate shift, and in doing so dramati- x = u + b, X = {x1...N } is the set of values of x over
PN
cally accelerates the training of deep neural nets. It ac- the training set, and E[x] = N1 i=1 xi . If a gradient
complishes this via a normalization step that fixes the descent step ignores the dependence of E[x] on b, then it
means and variances of layer inputs. Batch Normalization will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂b x. Then
also has a beneficial effect on the gradient flow through u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].
the network, by reducing the dependence of gradients Thus, the combination of the update to b and subsequent
on the scale of the parameters or of their initial values. change in normalization led to no change in the output
This allows us to use much higher learning rates with- of the layer nor, consequently, the loss. As the training
out the risk of divergence. Furthermore, batch normal- continues, b will grow indefinitely while the loss remains
ization regularizes the model and reduces the need for fixed. This problem can get worse if the normalization not
Dropout (Srivastava et al., 2014). Finally, Batch Normal- only centers but also scales the activations. We have ob-
ization makes it possible to use saturating nonlinearities served this empirically in initial experiments, where the
by preventing the network from getting stuck in the satu- model blows up when the normalization parameters are
rated modes. computed outside the gradient descent step.
The issue with the above approach is that the gradient
In Sec. 4.2, we apply Batch Normalization to the best- descent optimization does not take into account the fact
performing ImageNet classification network, and show that the normalization takes place. To address this issue,
that we can match its performance using only 7% of the we would like to ensure that, for any parameter values,
training steps, and can further exceed its accuracy by a the network always produces activations with the desired
substantial margin. Using an ensemble of such networks distribution. Doing so would allow the gradient of the
trained with Batch Normalization, we achieve the top-5 loss with respect to the model parameters to account for
error rate that improves upon the best known results on the normalization, and for its dependence on the model
ImageNet classification. parameters Θ. Let again x be a layer input, treated as a

2
vector, and X be the set of these inputs over the training Note that simply normalizing each input of a layer may
data set. The normalization can then be written as a trans- change what the layer can represent. For instance, nor-
formation malizing the inputs of a sigmoid would constrain them to
x = Norm(x, X )
b the linear regime of the nonlinearity. To address this, we
make sure that the transformation inserted in the network
which depends not only on the given training example x can represent the identity transform. To accomplish this,
but on all examples X – each of which depends on Θ if we introduce, for each activation x(k) , a pair of parameters
x is generated by another layer. For backpropagation, we γ (k) , β (k) , which scale and shift the normalized value:
would need to compute the Jacobians
y (k) = γ (k) x
b(k) + β (k) .
∂Norm(x, X ) ∂Norm(x, X )
and ; These parameters are learned along with the original
∂x ∂X
model parameters, and restore the representation
p power
ignoring the latter term would lead to the explosion de- of the network. Indeed, by setting γ (k) = Var[x(k) ] and
scribed above. β (k) = E[x(k) ], we could recover the original activations,
Within this framework, whitening the layer inputs is if that were the optimal thing to do.
complicated. One reason is that it would require a com- In the batch setting where each training step is based on
putation of covariance matrices over the training data; this the entire training set, we would use the whole set to nor-
would be hard to accomplish in a stochastic gradient de- malize activations. However, this is impractical when us-
scent setup. A more fundamental problem is that the nor- ing stochastic optimization. Therefore, we make the sec-
malization Norm(x, X ) would include the computation of ond simplification: since we use mini-batches in stochas-
the Singular Value Decomposition of X , which is not a tic gradient training, each mini-batch produces estimates
continuous function of X (O’Neil, 2005). So, ∂Norm(x,X
∂X
)
of the mean and variance of each activation. This way,
cannot be computed everywhere – and is expensive to the statistics used for normalization can fully participate
compute where it can be. This motivates us to seek an al- in the gradient backpropagation.
ternative that performs input normalization in a way that Consider a mini-batch B of size m. Since the normal-
is differentiable and does not require the analysis of the ization is applied to each activation independently, let us
entire training set after every parameter update. focus on a particular activation x(k) and omit k for clarity.
Some of the previous approaches (e.g. We have m values of this activation in the mini-batch,
(Lyu & Simoncelli, 2008)) use statistics computed
B = {x1...m }.
over a single training example, or, in the case of image
networks, over different feature maps at a given location. Let the normalized values be x b1...m , and their linear trans-
However, this changes the representation ability of a formations be y1...m . We refer to the transform
network by discarding the absolute scale of activations.
We want to a preserve the information in the network, by BNγ,β : x1...m → y1...m
normalizing the activations in a training example relative as the Batch Normalizing Transform. We present the BN
to the statistics of the entire training data. Transform in Algorithm 1. In the algorithm, ǫ is a constant
added to the mini-batch variance for numerical stability.
3 Normalization via Mini-Batch Input: Values of x over a mini-batch: B = {x1...m };
Statistics Parameters to be learned: γ, β
Output: {yi = BNγ,β (xi )}
Since the full whitening of each layer’s inputs is costly
m
and not everywhere differentiable, we make two neces- 1 X
sary simplifications. The first is that instead of whitening µB ← xi // mini-batch mean
m i=1
the features in layer inputs and outputs jointly, we will m
normalize each scalar feature independently, by making it 1 X
σB2 ← (xi − µB )2 // mini-batch variance
have the mean of zero and the variance of 1. For a layer m i=1
with d-dimensional input x = (x(1) . . . x(d) ), we will nor- xi − µB
malize each dimension bi ← p 2
x // normalize
σB + ǫ
x(k) − E[x(k) ] yi ← γb
xi + β ≡ BNγ,β (xi ) // scale and shift
b(k) = p
x
Var[x(k) ]
Algorithm 1: Batch Normalizing Transform, applied to
where the expectation and variance are computed over the activation x over a mini-batch.
training data set. As shown in (LeCun et al., 1998b), such
normalization speeds up convergence, even when the fea- The BN transform can be added to a network to manip-
tures are not decorrelated. ulate any activation. In the notation y = BNγ,β (x), we

3
indicate that the parameters γ and β are to be learned, (Duchi et al., 2011). The normalization of activations that
but it should be noted that the BN transform does not depends on the mini-batch allows efficient training, but is
independently process the activation in each training ex- neither necessary nor desirable during inference; we want
ample. Rather, BNγ,β (x) depends both on the training the output to depend only on the input, deterministically.
example and the other examples in the mini-batch. The For this, once the network has been trained, we use the
scaled and shifted values y are passed to other network normalization
layers. The normalized activations x b are internal to our x − E[x]
b= p
x
transformation, but their presence is crucial. The distri- Var[x] + ǫ
butions of values of any x b has the expected value of 0
using the population, rather than mini-batch, statistics.
and the variance of 1, as long as the elements of each
Neglecting ǫ, these normalized activations have the same
mini-batch are sampled from the same distribution, and
mean 0 and variance 1 as during training. We use the un-
if we neglect ǫ. ThisPcan be seen by observing that
P m
m 1 m biased variance estimate Var[x] = m−1 · EB [σB2 ], where
i=1 x
bi = 0 and m b2i = 1, and taking expec-
i=1 x the expectation is over training mini-batches of size m and
tations. Each normalized activation x b(k) can be viewed as
σB2 are their sample variances. Using moving averages in-
an input to a sub-network composed of the linear trans-
stead, we can track the accuracy of a model as it trains.
form y (k) = γ (k) xb(k) + β (k) , followed by the other pro-
Since the means and variances are fixed during inference,
cessing done by the original network. These sub-network
the normalization is simply a linear transform applied to
inputs all have fixed means and variances, and although
each activation. It may further be composed with the scal-
the joint distribution of these normalized x b(k) can change
ing by γ and shift by β, to yield a single linear transform
over the course of training, we expect that the introduc-
that replaces BN(x). Algorithm 2 summarizes the proce-
tion of normalized inputs accelerates the training of the
dure for training batch-normalized networks.
sub-network and, consequently, the network as a whole.
During training we need to backpropagate the gradi-
ent of loss ℓ through this transformation, as well as com- Input: Network N with trainable parameters Θ;
pute the gradients with respect to the parameters of the subset of activations {x(k) }K
k=1
BN transform. We use chain rule, as follows (before sim- Output: Batch-normalized network for inference, Ninf BN
tr
plification): 1: NBN ← N // Training BN network
2: for k = 1 . . . K do
∂ℓ ∂ℓ
∂bxi = ∂yi · γ 3: Add transformation y (k) = BNγ (k) ,β (k) (x(k) ) to
Pm ∂ℓ Ntr
∂ℓ
= · (xi − µB ) · −1 2 −3/2 BN (Alg. 1)
∂σB2 i=1 ∂bxi 2 (σB + ǫ)
4: Modify each layer in Ntr BN with input x
(k)
to take
Pm Pm (k)
∂ℓ
= ∂ℓ
· √ −1
+ ∂ℓ
· i=1 −2(xi −µB ) y instead
∂µB i=1 ∂b
xi 2 ∂σ 2 m
σB +ǫ B 5: end for
tr
∂ℓ
= ∂ℓ
·√ 1
+ ∂ℓ
· 2(xi −µB )
+ ∂ℓ
· 1 6: Train NBN to optimize the parameters Θ ∪
∂xi ∂b
xi 2 +ǫ ∂σB2 m ∂µB m
σB (k) (k) K
{γ , β }k=1
∂ℓ
Pm ∂ℓ inf tr
∂γ = i=1 ∂yi ·x
bi 7: NBN ← NBN // Inference BN network with frozen
∂ℓ Pm ∂ℓ // parameters
∂β = i=1 ∂yi
8: for k = 1 . . . K do
Thus, BN transform is a differentiable transformation that 9:
(k)
// For clarity, x ≡ x(k) , γ ≡ γ (k) , µB ≡ µB , etc.
introduces normalized activations into the network. This 10: Process multiple training mini-batches B, each of
ensures that as the model is training, layers can continue size m, and average over them:
learning on input distributions that exhibit less internal co-
E[x] ← EB [µB ]
variate shift, thus accelerating the training. Furthermore,
m 2
the learned affine transform applied to these normalized Var[x] ← m−1 EB [σB ]
activations allows the BN transform to represent the iden-
tity transformation and preserves the network capacity. 11: In Ninf
BN , replace the transform y = BN γ,β (x) with
γ γ E[x]
y= √ ·x+ β− √
Var[x]+ǫ Var[x]+ǫ
3.1 Training and Inference with Batch- 12: end for
Normalized Networks Algorithm 2: Training a Batch-Normalized Network
To Batch-Normalize a network, we specify a subset of ac-
tivations and insert the BN transform for each of them,
according to Alg. 1. Any layer that previously received 3.2 Batch-Normalized Convolutional Net-
x as the input, now receives BN(x). A model employing works
Batch Normalization can be trained using batch gradient
descent, or Stochastic Gradient Descent with a mini-batch Batch Normalization can be applied to any set of acti-
size m > 1, or with any of its variants such as Adagrad vations in the network. Here, we focus on transforms

4
that consist of an affine transformation followed by an the gradient during backpropagation and lead to the model
element-wise nonlinearity: explosion. However, with Batch Normalization, back-
propagation through a layer is unaffected by the scale of
z = g(W u + b) its parameters. Indeed, for a scalar a,
where W and b are learned parameters of the model, and BN(W u) = BN((aW )u)
g(·) is the nonlinearity such as sigmoid or ReLU. This for-
mulation covers both fully-connected and convolutional and we can show that
layers. We add the BN transform immediately before the
∂BN((aW )u) ∂BN(W u)
nonlinearity, by normalizing x = W u + b. We could have ∂u = ∂u
also normalized the layer inputs u, but since u is likely ∂BN((aW )u)
= 1 ∂BN(W u)
∂(aW ) a · ∂W
the output of another nonlinearity, the shape of its distri-
bution is likely to change during training, and constraining The scale does not affect the layer Jacobian nor, con-
its first and second moments would not eliminate the co- sequently, the gradient propagation. Moreover, larger
variate shift. In contrast, W u + b is more likely to have weights lead to smaller gradients, and Batch Normaliza-
a symmetric, non-sparse distribution, that is “more Gaus- tion will stabilize the parameter growth.
sian” (Hyvärinen & Oja, 2000); normalizing it is likely to We further conjecture that Batch Normalization may
produce activations with a stable distribution. lead the layer Jacobians to have singular values close to 1,
Note that, since we normalize W u+b, the bias b can be which is known to be beneficial for training (Saxe et al.,
ignored since its effect will be canceled by the subsequent 2013). Consider two consecutive layers with normalized
mean subtraction (the role of the bias is subsumed by β in inputs, and the transformation between these normalized
Alg. 1). Thus, z = g(W u + b) is replaced with vectors: bz = F (bx). If we assume that b
x and bz are Gaussian
and uncorrelated, and that F (b x) ≈ Jbx is a linear transfor-
z = g(BN(W u)) mation for the given model parameters, then both b x and bz
where the BN transform is applied independently to each have unit covariances, and I = Cov[bz] = JCov[b x]J T =
dimension of x = W u, with a separate pair of learned JJ T . Thus, JJ T = I, and so all singular values of J
parameters γ (k) , β (k) per dimension. are equal to 1, which preserves the gradient magnitudes
For convolutional layers, we additionally want the nor- during backpropagation. In reality, the transformation is
malization to obey the convolutional property – so that not linear, and the normalized values are not guaranteed to
different elements of the same feature map, at different be Gaussian nor independent, but we nevertheless expect
locations, are normalized in the same way. To achieve Batch Normalization to help make gradient propagation
this, we jointly normalize all the activations in a mini- better behaved. The precise effect of Batch Normaliza-
batch, over all locations. In Alg. 1, we let B be the set of tion on gradient propagation remains an area of further
all values in a feature map across both the elements of a study.
mini-batch and spatial locations – so for a mini-batch of
size m and feature maps of size p × q, we use the effec- 3.4 Batch Normalization regularizes the
tive mini-batch of size m′ = |B| = m · p q. We learn a model
pair of parameters γ (k) and β (k) per feature map, rather
than per activation. Alg. 2 is modified similarly, so that When training with Batch Normalization, a training ex-
during inference the BN transform applies the same linear ample is seen in conjunction with other examples in the
transformation to each activation in a given feature map. mini-batch, and the training network no longer produc-
ing deterministic values for a given training example. In
our experiments, we found this effect to be advantageous
3.3 Batch Normalization enables higher to the generalization of the network. Whereas Dropout
learning rates (Srivastava et al., 2014) is typically used to reduce over-
In traditional deep networks, too-high learning rate may fitting, in a batch-normalized network we found that it can
result in the gradients that explode or vanish, as well as be either removed or reduced in strength.
getting stuck in poor local minima. Batch Normaliza-
tion helps address these issues. By normalizing activa- 4 Experiments
tions throughout the network, it prevents small changes
to the parameters from amplifying into larger and subop-
4.1 Activations over time
timal changes in activations in gradients; for instance, it
prevents the training from getting stuck in the saturated To verify the effects of internal covariate shift on train-
regimes of nonlinearities. ing, and the ability of Batch Normalization to combat it,
Batch Normalization also makes training more resilient we considered the problem of predicting the digit class on
to the parameter scale. Normally, large learning rates may the MNIST dataset (LeCun et al., 1998a). We used a very
increase the scale of layer parameters, which then amplify simple network, with a 28x28 binary image as input, and

5
1
2 2 details are given in the Appendix. We refer to this model
0.9
as Inception in the rest of the text. The model was trained
0 0
0.8
Without BN using a version of Stochastic Gradient Descent with mo-
With BN
0.7
10K 20K 30K 40K 50K−2 −2 mentum (Sutskever et al., 2013), using the mini-batch size
(a) (b) Without BN (c) With BN of 32. The training was performed using a large-scale, dis-
tributed architecture (similar to (Dean et al., 2012)). All
Figure 1: (a) The test accuracy of the MNIST network networks are evaluated as training progresses by comput-
trained with and without Batch Normalization, vs. the ing the validation accuracy @1, i.e. the probability of
number of training steps. Batch Normalization helps the predicting the correct label out of 1000 possibilities, on
network train faster and achieve higher accuracy. (b, a held-out set, using a single crop per image.
c) The evolution of input distributions to a typical sig- In our experiments, we evaluated several modifications
moid, over the course of training, shown as {15, 50, 85}th of Inception with Batch Normalization. In all cases, Batch
percentiles. Batch Normalization makes the distribution Normalization was applied to the input of each nonlinear-
more stable and reduces the internal covariate shift. ity, in a convolutional way, as described in section 3.2,
while keeping the rest of the architecture constant.
3 fully-connected hidden layers with 100 activations each.
Each hidden layer computes y = g(W u+b) with sigmoid
4.2.1 Accelerating BN Networks
nonlinearity, and the weights W initialized to small ran-
dom Gaussian values. The last hidden layer is followed Simply adding Batch Normalization to a network does not
by a fully-connected layer with 10 activations (one per take full advantage of our method. To do so, we further
class) and cross-entropy loss. We trained the network for changed the network and its training parameters, as fol-
50000 steps, with 60 examples per mini-batch. We added lows:
Batch Normalization to each hidden layer of the network,
Increase learning rate. In a batch-normalized model,
as in Sec. 3.1. We were interested in the comparison be-
we have been able to achieve a training speedup from
tween the baseline and batch-normalized networks, rather
higher learning rates, with no ill side effects (Sec. 3.3).
than achieving the state of the art performance on MNIST
(which the described architecture does not). Remove Dropout. As described in Sec. 3.4, Batch Nor-
Figure 1(a) shows the fraction of correct predictions malization fulfills some of the same goals as Dropout. Re-
by the two networks on held-out test data, as training moving Dropout from Modified BN-Inception speeds up
progresses. The batch-normalized network enjoys the training, without increasing overfitting.
higher test accuracy. To investigate why, we studied in- Reduce the L2 weight regularization. While in Incep-
puts to the sigmoid, in the original network N and batch- tion an L2 loss on the model parameters controls overfit-
normalized network Ntr ting, in Modified BN-Inception the weight of this loss is
BN (Alg. 2) over the course of train-
reduced by a factor of 5. We find that this improves the
ing. In Fig. 1(b,c) we show, for one typical activation from
the last hidden layer of each network, how its distribu- accuracy on the held-out validation data.
tion evolves. The distributions in the original network Accelerate the learning rate decay. In training Incep-
change significantly over time, both in their mean and tion, learning rate was decayed exponentially. Because
the variance, which complicates the training of the sub- our network trains faster than Inception, we lower the
learning rate 6 times faster.
sequent layers. In contrast, the distributions in the batch-
normalized network are much more stable as training pro- Remove Local Response Normalization While Incep-
gresses, which aids the training. tion and other networks (Srivastava et al., 2014) benefit
from it, we found that with Batch Normalization it is not
necessary.
4.2 ImageNet classification
Shuffle training examples more thoroughly. We enabled
We applied Batch Normalization to a new variant of the within-shard shuffling of the training data, which prevents
Inception network (Szegedy et al., 2014), trained on the the same examples from always appearing in a mini-batch
ImageNet classification task (Russakovsky et al., 2014). together. This led to about 1% improvements in the val-
The network has a large number of convolutional and idation accuracy, which is consistent with the view of
pooling layers, with a softmax layer to predict the image Batch Normalization as a regularizer (Sec. 3.4): the ran-
class, out of 1000 possibilities. Convolutional layers use domization inherent in our method should be most bene-
ReLU as the nonlinearity. The main difference to the net- ficial when it affects an example differently each time it is
work described in (Szegedy et al., 2014) is that the 5 × 5 seen.
convolutional layers are replaced by two consecutive lay- Reduce the photometric distortions. Because batch-
ers of 3 × 3 convolutions with up to 128 filters. The net- normalized networks train faster and observe each train-
work contains 13.6 · 106 parameters, and, other than the ing example fewer times, we let the trainer focus on more
top softmax layer, has no fully-connected layers. More “real” images by distorting them less.

6
0.8

0.7
Model Steps to 72.2% Max accuracy
0.6
Inception 31.0 · 106 72.2%
BN-Baseline 13.3 · 106 72.7%
Inception
BN−Baseline BN-x5 2.1 · 106 73.0%
0.5 BN−x5
BN−x30
BN-x30 2.7 · 106 74.8%
BN−x5−Sigmoid BN-x5-Sigmoid 69.8%
Steps to match Inception
0.4
5M 10M 15M 20M 25M 30M Figure 3: For Inception and the batch-normalized
variants, the number of training steps required to
Figure 2: Single crop validation accuracy of Inception reach the maximum accuracy of Inception (72.2%),
and its batch-normalized variants, vs. the number of and the maximum accuracy achieved by the net-
training steps. work.

4.2.2 Single-Network Classification to be trained when sigmoid is used as the nonlinearity,

despite the well-known difficulty of training such net-
We evaluated the following networks, all trained on the works. Indeed, BN-x5-Sigmoid achieves the accuracy of
LSVRC2012 training data, and tested on the validation 69.8%. Without Batch Normalization, Inception with sig-
data: moid never achieves better than 1/1000 accuracy.
Inception: the network described at the beginning of
Section 4.2, trained with the initial learning rate of 0.0015.
BN-Baseline: Same as Inception with Batch Normal- 4.2.3 Ensemble Classification
ization before each nonlinearity. The current reported best results on the ImageNet Large
BN-x5: Inception with Batch Normalization and the Scale Visual Recognition Competition are reached by the
modifications in Sec. 4.2.1. The initial learning rate was Deep Image ensemble of traditional models (Wu et al.,
increased by a factor of 5, to 0.0075. The same learning 2015) and the ensemble model of (He et al., 2015). The
rate increase with original Inception caused the model pa- latter reports the top-5 error of 4.94%, as evaluated by the
rameters to reach machine infinity. ILSVRC server. Here we report a top-5 validation error of
BN-x30: Like BN-x5, but with the initial learning rate 4.9%, and test error of 4.82% (according to the ILSVRC
0.045 (30 times that of Inception). server). This improves upon the previous best result, and
BN-x5-Sigmoid: Like BN-x5, but with sigmoid non- exceeds the estimated accuracy of human raters according
1
linearity g(t) = 1+exp(−x) instead of ReLU. We also at- to (Russakovsky et al., 2014).
tempted to train the original Inception with sigmoid, but For our ensemble, we used 6 networks. Each was based
the model remained at the accuracy equivalent to chance. on BN-x30, modified via some of the following: increased
In Figure 2, we show the validation accuracy of the initial weights in the convolutional layers; using Dropout
networks, as a function of the number of training steps. (with the Dropout probability of 5% or 10%, vs. 40%
Inception reached the accuracy of 72.2% after 31 · 106 for the original Inception); and using non-convolutional,
training steps. The Figure 3 shows, for each network, per-activation Batch Normalization with last hidden lay-
the number of training steps required to reach the same ers of the model. Each network achieved its maximum
72.2% accuracy, as well as the maximum validation accu- accuracy after about 6 · 106 training steps. The ensemble
racy reached by the network and the number of steps to prediction was based on the arithmetic average of class
reach it. probabilities predicted by the constituent networks. The
By only using Batch Normalization (BN-Baseline), we details of ensemble and multicrop inference are similar to
match the accuracy of Inception in less than half the num- (Szegedy et al., 2014).
ber of training steps. By applying the modifications in We demonstrate in Fig. 4 that batch normalization al-
Sec. 4.2.1, we significantly increase the training speed of lows us to set new state-of-the-art by a healthy margin on
the network. BN-x5 needs 14 times fewer steps than In- the ImageNet classification challenge benchmarks.
ception to reach the 72.2% accuracy. Interestingly, in-
creasing the learning rate further (BN-x30) causes the
model to train somewhat slower initially, but allows it to 5 Conclusion
reach a higher final accuracy. It reaches 74.8% after 6·106
steps, i.e. 5 times fewer steps than required by Inception We have presented a novel mechanism for dramatically
to reach 72.2%. accelerating the training of deep networks. It is based on
We also verified that the reduction in internal covari- the premise that covariate shift, which is known to com-
ate shift allows deep networks with Batch Normalization plicate the training of machine learning systems, also ap-

7
Model Resolution Crops Models Top-1 error Top-5 error
GoogLeNet ensemble 224 144 7 - 6.67%
Deep Image low-res 256 - 1 - 7.96%
Deep Image high-res 512 - 1 24.88 7.42%
Deep Image ensemble variable - - - 5.98%
BN-Inception single crop 224 1 1 25.2% 7.82%
BN-Inception multicrop 224 144 1 21.99% 5.82%
BN-Inception ensemble 224 144 6 20.1% 4.9%*

Figure 4: Batch-Normalized Inception comparison with previous state of the art on the provided validation set com-
prising 50000 images. *BN-Inception ensemble has reached 4.82% top-5 error on the 100000 images of the test set of
the ImageNet as reported by the test server.

plies to sub-networks and layers, and removing it from entiating characteristics of Batch Normalization include
internal activations of the network may aid in training. the learned scale and shift that allow the BN transform
Our proposed method draws its power from normalizing to represent identity (the standardization layer did not re-
activations, and from incorporating this normalization in quire this since it was followed by the learned linear trans-
the network architecture itself. This ensures that the nor- form that, conceptually, absorbs the necessary scale and
malization is appropriately handled by any optimization shift), handling of convolutional layers, deterministic in-
method that is being used to train the network. To en- ference that does not depend on the mini-batch, and batch-
able stochastic optimization methods commonly used in normalizing each convolutional layer in the network.
deep network training, we perform the normalization for In this work, we have not explored the full range of
each mini-batch, and backpropagate the gradients through possibilities that Batch Normalization potentially enables.
the normalization parameters. Batch Normalization adds Our future work includes applications of our method to
only two extra parameters per activation, and in doing so Recurrent Neural Networks (Pascanu et al., 2013), where
preserves the representation ability of the network. We the internal covariate shift and the vanishing or exploding
presented an algorithm for constructing, training, and per- gradients may be especially severe, and which would al-
forming inference with batch-normalized networks. The low us to more thoroughly test the hypothesis that normal-
resulting networks can be trained with saturating nonlin- ization improves gradient propagation (Sec. 3.3). We plan
earities, are more tolerant to increased training rates, and to investigate whether Batch Normalization can help with
often do not require Dropout for regularization. domain adaptation, in its traditional sense – i.e. whether
Merely adding Batch Normalization to a state-of-the- the normalization performed by the network would al-
art image classification model yields a substantial speedup low it to more easily generalize to new data distribu-
in training. By further increasing the learning rates, re- tions, perhaps with just a recomputation of the population
moving Dropout, and applying other modifications af- means and variances (Alg. 2). Finally, we believe that fur-
forded by Batch Normalization, we reach the previous ther theoretical analysis of the algorithm would allow still
state of the art with only a small fraction of training steps more improvements and applications.
– and then beat the state of the art in single-network image
classification. Furthermore, by combining multiple mod-
els trained with Batch Normalization, we perform better References
than the best known system on ImageNet, by a significant
margin. Bengio, Yoshua and Glorot, Xavier. Understanding the
difficulty of training deep feedforward neural networks.
Interestingly, our method bears similarity to the stan- In Proceedings of AISTATS 2010, volume 9, pp. 249–
dardization layer of (Gülçehre & Bengio, 2013), though 256, May 2010.
the two methods stem from very different goals, and per-
form different tasks. The goal of Batch Normalization Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai,
is to achieve a stable distribution of activation values Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato,
throughout training, and in our experiments we apply it Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke,
before the nonlinearity since that is where matching the and Ng, Andrew Y. Large scale distributed deep net-
first and second moments is more likely to result in a works. In NIPS, 2012.
stable distribution. On the contrary, (Gülçehre & Bengio,
2013) apply the standardization layer to the output of the Desjardins, Guillaume and Kavukcuoglu, Koray. Natural
nonlinearity, which results in sparser activations. In our neural networks. (unpublished).
large-scale image classification experiments, we have not
observed the nonlinearity inputs to be sparse, neither with Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive
nor without Batch Normalization. Other notable differ- subgradient methods for online learning and stochastic

8
optimization. J. Mach. Learn. Res., 12:2121–2159, July Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan,
2011. ISSN 1532-4435. Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa-
thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg,
Gülçehre, Çaglar and Bengio, Yoshua. Knowledge mat- Alexander C., and Fei-Fei, Li. ImageNet Large Scale
ters: Importance of prior information for optimization. Visual Recognition Challenge, 2014.
CoRR, abs/1301.4083, 2013.
Saxe, Andrew M., McClelland, James L., and Ganguli,
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep Surya. Exact solutions to the nonlinear dynamics
into Rectifiers: Surpassing Human-Level Performance of learning in deep linear neural networks. CoRR,
on ImageNet Classification. ArXiv e-prints, February abs/1312.6120, 2013.
2015. Shimodaira, Hidetoshi. Improving predictive inference
under covariate shift by weighting the log-likelihood
Hyvärinen, A. and Oja, E. Independent component anal-
function. Journal of Statistical Planning and Inference,
ysis: Algorithms and applications. Neural Netw., 13
90(2):227–244, October 2000.
(4-5):411–430, May 2000.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Jiang, Jing. A literature survey on do- Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
main adaptation of statistical classifiers. A simple way to prevent neural networks from overfit-
http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/,
ting. J. Mach. Learn. Res., 15(1):1929–1958, January
2008. Accessed: 2014-01-24. 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Sutskever, Ilya, Martens, James, Dahl, George E., and
Gradient-based learning applied to document recog- Hinton, Geoffrey E. On the importance of initial-
nition. Proceedings of the IEEE, 86(11):2278–2324, ization and momentum in deep learning. In ICML
November 1998a. (3), volume 28 of JMLR Proceedings, pp. 1139–1147.
JMLR.org, 2013.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient
backprop. In Orr, G. and K., Muller (eds.), Neural Net- Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
works: Tricks of the trade. Springer, 1998b. Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
mitru, Vanhoucke, Vincent, and Rabinovich, An-
Lyu, S and Simoncelli, E P. Nonlinear image representa- drew. Going deeper with convolutions. CoRR,
tion using divisive normalization. In Proc. Computer abs/1409.4842, 2014.
Vision and Pattern Recognition, pp. 1–8. IEEE Com- Wiesler, Simon and Ney, Hermann. A convergence anal-
puter Society, Jun 23-28 2008. doi: 10.1109/CVPR. ysis of log-linear training. In Shawe-Taylor, J., Zemel,
2008.4587821. R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q.
(eds.), Advances in Neural Information Processing Sys-
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units
tems 24, pp. 657–665, Granada, Spain, December 2011.
improve restricted boltzmann machines. In ICML, pp.
807–814. Omnipress, 2010. Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and
Ney, Hermann. Mean-normalized stochastic gradient
O’Neil, Kevin A. Critical points of the singular value de- for large-scale deep learning. In IEEE International
composition. SIAM J. Matrix Analysis Applications, 27 Conference on Acoustics, Speech, and Signal Process-
(2):459–473, 2005. ing, pp. 180–184, Florence, Italy, May 2014.

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and
On the difficulty of training recurrent neural networks. Sun, Gang. Deep image: Scaling up image recognition,
In Proceedings of the 30th International Conference on 2015.
Machine Learning, ICML 2013, Atlanta, GA, USA, 16-
21 June 2013, pp. 1310–1318, 2013.
Appendix
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San-
jeev. Parallel training of deep neural networks with Variant of the Inception Model Used
natural gradient and parameter averaging. CoRR,
abs/1410.7455, 2014. Figure 5 documents the changes that were performed
compared to the architecture with respect to the
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep GoogleNet archictecture. For the interpretation of this
learning made easier by linear transformations in per- table, please consult (Szegedy et al., 2014). The notable
ceptrons. In International Conference on Artificial In- architecture changes compared to the GoogLeNet model
telligence and Statistics (AISTATS), pp. 924–932, 2012. include:

9
• The 5×5 convolutional layers are replaced by two
consecutive 3×3 convolutional layers. This in-
creases the maximum depth of the network by 9
weight layers. Also it increases the number of pa-
rameters by 25% and the computational cost is in-
creased by about 30%.
• The number 28×28 inception modules is increased
from 2 to 3.
• Inside the modules, sometimes average, sometimes
maximum-pooling is employed. This is indicated in
the entries corresponding to the pooling layers of the
table.
• There are no across the board pooling layers be-
tween any two Inception modules, but stride-2 con-
volution/pooling layers are employed before the fil-
ter concatenation in the modules 3c, 4e.
Our model employed separable convolution with depth
multiplier 8 on the first convolutional layer. This reduces
the computational cost while increasing the memory con-
sumption at training time.

10
patch size/ output #3×3 double #3×3 double
type depth #1×1 #3×3 Pool +proj
stride size reduce reduce #3×3
convolution* 7×7/2 112×112×64 1
max pool 3×3/2 56×56×64 0
convolution 3×3/1 56×56×192 1 64 192
max pool 3×3/2 28×28×192 0
inception (3a) 28×28×256 3 64 64 64 64 96 avg + 32
inception (3b) 28×28×320 3 64 64 96 64 96 avg + 64
inception (3c) stride 2 28×28×576 3 0 128 160 64 96 max + pass through
inception (4a) 14×14×576 3 224 64 96 96 128 avg + 128
inception (4b) 14×14×576 3 192 96 128 96 128 avg + 128
inception (4c) 14×14×576 3 160 128 160 128 160 avg + 128
inception (4d) 14×14×576 3 96 128 192 160 192 avg + 128
inception (4e) stride 2 14×14×1024 3 0 128 192 192 256 max + pass through
inception (5a) 7×7×1024 3 352 192 320 160 224 avg + 128
inception (5b) 7×7×1024 3 352 192 320 192 224 max + 128
avg pool 7×7/1 1×1×1024 0

Figure 5: Inception architecture

MineralTree User Guide 6.3.0
0% (1)
MineralTree User Guide 6.3.0
57 pages
Dynalite Installers Course Training Guide
No ratings yet
Dynalite Installers Course Training Guide
86 pages
aDSA SuperComp4Trng DNN
No ratings yet
aDSA SuperComp4Trng DNN
12 pages
DL
No ratings yet
DL
12 pages
Assignment_13_Modern_AI
No ratings yet
Assignment_13_Modern_AI
3 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Unit II
No ratings yet
Unit II
56 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Marginalized Denoising Auto-Encoders For Nonlinear Representations
No ratings yet
Marginalized Denoising Auto-Encoders For Nonlinear Representations
9 pages
Final CCIS Paper - 15 August 2019
No ratings yet
Final CCIS Paper - 15 August 2019
6 pages
NeurIPS21 Implicit
No ratings yet
NeurIPS21 Implicit
24 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
An Experimental Survey of Simple K-Nearest Neighbour Condensing and Editing Algorithms
No ratings yet
An Experimental Survey of Simple K-Nearest Neighbour Condensing and Editing Algorithms
8 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
WEEK 9
No ratings yet
WEEK 9
80 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
2017 IDEC guo
No ratings yet
2017 IDEC guo
7 pages
noise_step
No ratings yet
noise_step
4 pages
Parameter-Efficient Transfer Learning For NLP
No ratings yet
Parameter-Efficient Transfer Learning For NLP
10 pages
path_sgd_behnam
No ratings yet
path_sgd_behnam
12 pages
Global Sparse Momentum SGD For Pruning Very Deep Neural Networks
No ratings yet
Global Sparse Momentum SGD For Pruning Very Deep Neural Networks
13 pages
Learning Where To Learn - Gradient Sparsity in Meta and Continual Learning
No ratings yet
Learning Where To Learn - Gradient Sparsity in Meta and Continual Learning
25 pages
Gradient Descent Algorithm and Its Variants - GeeksforGeeks
No ratings yet
Gradient Descent Algorithm and Its Variants - GeeksforGeeks
5 pages
Machine Learning Techniques Final Report (Fall, 2020) : ML - Explorer
No ratings yet
Machine Learning Techniques Final Report (Fall, 2020) : ML - Explorer
6 pages
DL mod 2
No ratings yet
DL mod 2
4 pages
Learning To Learn by Gradient Descent by Gradient Descent
No ratings yet
Learning To Learn by Gradient Descent by Gradient Descent
10 pages
Bio Optimization of Deep Learning Network Architectures 22fguqp5
No ratings yet
Bio Optimization of Deep Learning Network Architectures 22fguqp5
11 pages
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
No ratings yet
2016-CVPR-Joint Unsupervised Learning of Deep Representations and Image Clusters
10 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
No ratings yet
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
8 pages
1734451533458_2425-CS420-22TT-HW04
No ratings yet
1734451533458_2425-CS420-22TT-HW04
6 pages
MODULE 2 DL
No ratings yet
MODULE 2 DL
9 pages
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
No ratings yet
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
5 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks
No ratings yet
Amos, Kolter - 2017 - OptNet Differentiable Optimization As A Layer in Neural Networks
10 pages
NeurIPS 2021 Pay Attention To Mlps Paper
No ratings yet
NeurIPS 2021 Pay Attention To Mlps Paper
12 pages
Reconciling Lambda-Returns With Experience Replay
No ratings yet
Reconciling Lambda-Returns With Experience Replay
17 pages
Mask DM
No ratings yet
Mask DM
23 pages
Do Deep Nets Really Need to be Deep
No ratings yet
Do Deep Nets Really Need to be Deep
9 pages
Collaborative Learning For Faster StyleGAN Embedding
No ratings yet
Collaborative Learning For Faster StyleGAN Embedding
10 pages
An Efficient Distributed Stochastic Gradient Descent Algorithm for Deep-Learning Applications
No ratings yet
An Efficient Distributed Stochastic Gradient Descent Algorithm for Deep-Learning Applications
10 pages
Multi Perceptor
No ratings yet
Multi Perceptor
37 pages
2007.13904v2
No ratings yet
2007.13904v2
20 pages
2015TrainingArtificialNeuralNetworkUsingModificationofDifferentialEvolutionAlgorithm
No ratings yet
2015TrainingArtificialNeuralNetworkUsingModificationofDifferentialEvolutionAlgorithm
7 pages
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
No ratings yet
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
16 pages
A Heuristic Algorithm To Incremental Support Vector Machine Learning
No ratings yet
A Heuristic Algorithm To Incremental Support Vector Machine Learning
4 pages
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
No ratings yet
2024_How Lightweight Can A Vision Transformer Be_Tan_arXiv
8 pages
AdamZ research paper
No ratings yet
AdamZ research paper
13 pages
Unit - 4 ANN
No ratings yet
Unit - 4 ANN
17 pages
4514-Article Text-7553-1-10-20190706
No ratings yet
4514-Article Text-7553-1-10-20190706
8 pages
Bumblebee: Secure Two-Party Inference Framework For Large Transformers
No ratings yet
Bumblebee: Secure Two-Party Inference Framework For Large Transformers
18 pages
Rennie 2014
No ratings yet
Rennie 2014
6 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
How To Train Your Pre Trained GAN Models: Sung Wook Park Jun Yeong Kim Jun Park Se Hoon Jung Chun Bo Sim
No ratings yet
How To Train Your Pre Trained GAN Models: Sung Wook Park Jun Yeong Kim Jun Park Se Hoon Jung Chun Bo Sim
26 pages
Sarlin SuperGlue Learning Feature Matching With Graph Neural Networks CVPR 2020 Paper
No ratings yet
Sarlin SuperGlue Learning Feature Matching With Graph Neural Networks CVPR 2020 Paper
10 pages
SSF
No ratings yet
SSF
20 pages
Depth Dropout
No ratings yet
Depth Dropout
7 pages
MODULE 2 DL SNOTES P1
No ratings yet
MODULE 2 DL SNOTES P1
16 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Next-Generation Service Assurance Improves Operational Efficiency
No ratings yet
Next-Generation Service Assurance Improves Operational Efficiency
2 pages
Update History For Office 2013
No ratings yet
Update History For Office 2013
4 pages
Easy-Pro Builders Estimating Spreadsheet Manual V8 - 0
100% (1)
Easy-Pro Builders Estimating Spreadsheet Manual V8 - 0
20 pages
Product Details of Total Station Survey Instrument Topcon GTS-239
No ratings yet
Product Details of Total Station Survey Instrument Topcon GTS-239
3 pages
Python Basics-1
No ratings yet
Python Basics-1
13 pages
Changing Network Setting Console STRM
No ratings yet
Changing Network Setting Console STRM
8 pages
Chapter 4 - Emergig-Technology-IOT
No ratings yet
Chapter 4 - Emergig-Technology-IOT
48 pages
SAR ADC Tutorial
75% (4)
SAR ADC Tutorial
13 pages
DH v1-2 GM Screen
No ratings yet
DH v1-2 GM Screen
4 pages
Countries List
No ratings yet
Countries List
4 pages
Certificate of Non Property
No ratings yet
Certificate of Non Property
1 page
Laptop Lecture
No ratings yet
Laptop Lecture
7 pages
AHE Introduction Deck 1.2.24
No ratings yet
AHE Introduction Deck 1.2.24
17 pages
Exploiting Honeypot For Cryptojacking The Other Side of The Story of Honeypot Deployment
No ratings yet
Exploiting Honeypot For Cryptojacking The Other Side of The Story of Honeypot Deployment
5 pages
First Course in Mathematical Modeling 5th ed. A instant download
No ratings yet
First Course in Mathematical Modeling 5th ed. A instant download
14 pages
Interfacing The Led Using 8086 Microprocessor
No ratings yet
Interfacing The Led Using 8086 Microprocessor
12 pages
1.5 - Elementary Matrices and Finding Inverse of A Matrix-1
No ratings yet
1.5 - Elementary Matrices and Finding Inverse of A Matrix-1
14 pages
CT7075 Information Security Management - K Poowanendiran Feedback
No ratings yet
CT7075 Information Security Management - K Poowanendiran Feedback
6 pages
Retrobelt Mustang Instructions 090908 Final
No ratings yet
Retrobelt Mustang Instructions 090908 Final
11 pages
Bca Lesson Plan-Ites
No ratings yet
Bca Lesson Plan-Ites
4 pages
ACS 1000 Medium Voltage Drives: 315 - 5000 KW 400 - 6200 HP
No ratings yet
ACS 1000 Medium Voltage Drives: 315 - 5000 KW 400 - 6200 HP
112 pages
Action Plan - Marketing 2023
No ratings yet
Action Plan - Marketing 2023
5 pages
Book Shop System Management
No ratings yet
Book Shop System Management
18 pages
Efficient C Coding For AVR
No ratings yet
Efficient C Coding For AVR
15 pages
Auditing in Computer Environment Presentation 1224128964994975 8
100% (1)
Auditing in Computer Environment Presentation 1224128964994975 8
90 pages
DRF No: Rta
No ratings yet
DRF No: Rta
3 pages
21BCE9257 - CN - Lab Assignment 4
No ratings yet
21BCE9257 - CN - Lab Assignment 4
8 pages
The Hybrid FEM/FDM Computer Model For Analysis of The Metering Section of A Single-Screw Extruder
No ratings yet
The Hybrid FEM/FDM Computer Model For Analysis of The Metering Section of A Single-Screw Extruder
10 pages