0% found this document useful (0 votes)

80 views7 pages

CS 182 Discussion: CNN Techniques

Q: Compare the usage of dropout and batch normalization in a network; can they be effectively combined, and if so, how?

Dropout and batch normalization address overfitting and improve training stability but in different ways—dropout reduces overfitting by averaging over networks, while batch normalization stabilizes learning rates and normalization paths. They can be combined effectively; batch normalization should be applied before dropout since dropout assumes independent input variance, which normalization can disrupt. Proper configuration enhances network robustness without significant training issues .

Q: In what ways does the dropout technique regularize neural networks, and why is activation scaling necessary when using dropout?

Dropout regularizes neural networks by randomly deactivating a set proportion of nodes during training, which reduces overfitting by effectively training an ensemble of subnetworks. Activation scaling by dividing by the keep probability ensures that the expected sum of activations remains consistent during training and inference, maintaining the model's performance when dropout is not applied .

Q: What are the main benefits of using Batch Normalization in training neural networks, and how does it work on a technical level?

Batch Normalization standardizes the inputs to each layer by setting their mean to 0 and variance to 1, which stabilizes the learning process and makes the network robust to initialization. It reduces internal covariate shift and potentially improves the optimization landscape. Technically, it involves computing the empirical mean and variance of each mini-batch and then normalizing and rescaling the activations using learnable parameters gamma (γ) and beta (β).

Q: In the context of neural networks, what advantages do Leaky ReLU and ELU offer over standard ReLU, particularly in dealing with negative inputs?

Leaky ReLU and ELU mitigate the 'dying ReLU' problem by allowing a small, non-zero gradient for negative inputs, which helps maintain some degree of learning activity. ELU further smoothens the output in the negative direction, potentially improving convergence speed. These variations enhance model robustness by ensuring more neurons are active and contributing to the gradient flow .

Q: What is the primary intuition behind Xavier Initialization, and why is it important for tanh activations?

Xavier Initialization aims to keep the variance of the activations nearly constant across layers. For tanh activations, which can saturate, keeping inputs around zero prevents gradients from vanishing or exploding. By initializing weights to a Gaussian with variance 1 divided by the input dimension, Xavier Initialization helps stabilize learning, especially when using activation functions that have a smaller linear region .

Q: How does the concept of ensembles enhance the performance of neural networks and what are the different approaches to implementing ensemble methods?

Ensembles enhance neural network performance by aggregating predictions from multiple models, reducing variance and improving generalization. Approaches include prediction averaging, where predictions from independently trained models are averaged, and parameter averaging, focusing on averaging parameters across snapshots of a single training trajectory. These methods exploit the diversity among models to make more robust predictions .

Q: Describe the implications of VGGNet's architecture in terms of parameter count and neural network performance.

VGGNet's architecture, characterized by its deep and uniform layers, has around 138 million parameters, which is computationally intensive. Although it showed improved performance with deeper network layers, the large number of parameters makes it less efficient compared to other architectures such as ResNet, which achieves similar or better performance with fewer parameters through innovations like skip connections .

Q: How do skip connections used in Residual Networks help address the vanishing gradient problem commonly encountered in deep neural networks?

Skip connections in Residual Networks help mitigate the vanishing gradient problem by allowing gradients to bypass layers. This ensures that gradients can propagate more effectively through the network during backpropagation, enabling the training of deeper neural architectures without significant loss of gradient magnitude .

Q: Discuss the advantages and limitations of using ReLU activations compared to sigmoid activations in neural networks.

ReLU activations are computationally efficient and help mitigate the vanishing gradient problem that challenges sigmoid functions. However, ReLU can suffer from the 'dying ReLU' problem, where neurons can become inactive when cumulative input is negative, producing zero gradients. Sigmoid functions, by contrast, are prone to saturation, resulting in small gradients for extreme values, but they always produce positive gradients .

Q: How does the concept of snapshot ensembles attempt to overcome the computational inefficiencies of traditional ensemble methods, and what is its impact on model diversity?

Snapshot ensembles mitigate computational costs by saving parameter states at different training stages instead of training multiple models independently. This approach provides diverse models as snapshots capture different points along the optimization path, maintaining substantial variance among them, which enhances the ensemble's predictive power without the overhead of multiple, complete training processes .

This discussion covers key concepts in deep neural networks, including various CNN architectures like LeNet, AlexNet, VGGNet, and ResNet, along with techniques such as Batch Normalization, Ensembles, Dropout, and Weight Initialization. It highlights the importance of these methods in improving model performance and robustness. Additionally, it addresses the advantages of ReLU activations and their relatives in neural network training.

Uploaded by

AAA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views7 pages

CS 182 Discussion: CNN Techniques

Uploaded by

AAA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 4

This discussion will cover CNN architectures, batch normalization, weight initializations, ensembles
and dropout.

1 Convolutional Neural Networks Architectures

We will survey the some most famous convolutional neural net architectures.

LeNet. Among the earlier CNN architectures, LeNet is the most widely known. LeNet was used mostly
for handwritten digit recognition on the MNIST dataset. Importantly, LeNet used a series of convolutional
layers, then pooling layers, followed by several fully connected (FC) layers.

AlexNet. The AlexNet architecture popularized CNNs in computer vision, when it won the ImageNet
ILSVRC Challenge in 2012 by a large margin. AlexNet has a similar architectural design as LeNet, except
that it is bigger (more neurons) and deeper (more layers). In addition, AlexNet demonstrated the benefits of
using the ReLU activation and dropout for vision tasks, as well as the use of GPUs for accelerated training.

VGGNet This network was the runner-up in ILSVRC 2014 to GoogLeNet, and showed the benefit of
(a) increasing the number of layers, and (b) using only convolutional operators stacked on each other. A
downside is that this network has roughly 138 million parameters, so in general, consider using Residual
Nets (see next item).

ResNet These networks use skip connections to allow inputs and gradients to propagate faster throughout
the network (either forward or backwards). Residual networks were state of the art for image recognition
results in mid-2016, and the general backbone is commonly used as of today. They have substantially
fewer parameters than VGG. The exact number depends on what type of “ResNet-X” is used, where “X”
represents the number of layers; PyTorch offers pretrained models for 18, 34, 50, 101, and 152. For reference,
ResNet-152 should have about 60 million parameters.)
Problem: Vanishing Gradients in ResNet

How does skip connection in ResNet help solve the vanishing gradient problem?

CS 182/282A, Spring 2021, Discussion 4 1

2 Batch Normalization
The main idea behind Batch Normalization is to transform every sampled batch of data so that they have
µ = 0, σ 2 = 1. Using Batch Normalization typically makes networks significantly more robust to poor
initialization. It based on the intuition that it is better to have unit Gaussian inputs to layers at initial-
ization. However, the reason for why batch normalization works is not entirely understood, and there are
conflicting views between whether Batch Normalization reduces covariate shift, improves smoothness over
the optimization landscape, or other reasons.
In practice, when using batch normalization, we add a BatchNorm layer immediately after ecah FC or
convolutional layer, either before or after the non-linearity. The key observation is that normalization is a
relatively simple differentiable operation, so we do not add too much additional complexity in the network.
Noticeably, Batch Normalization proceeds by first computing the empirical mean and variance of some
mini-batch B of size m from the training set.
m
1 X
µB = ai
m i=1
m
2 1 X
σB = (ai − µB )2
m i=1

Then, for a layer of network, each dimension, a(i) is normalized appropriately,

(k) (k)
(k) a − µB
āi = ri 2
(k)
σB +

where is added for numerical stability.

In practice, after normalizing the input, we squash the result through a linear function with learnable scale
γ and bias β, so, we have,
(k) (k)
(k) a − µB
āi = ri 2 γ+β
(k)
σB +

Intuitively, γ and β allows us to restore the original activation if we would like, and during training, they
can learn other distribution that would be better initialization than standard Gaussian.
Problem: Examining the BatchNorm Layer

1. Draw out the computational graph of the BatchNorm layer

2. Given some dout, the derivative of the output of the BatchNorm layer, compute the derivatives
with respect to input x and parameters γ, β

CS 182/282A, Spring 2021, Discussion 4 2

3 Ensembles
Definition 1 (Ensemble). Ensemble (bagging or boosting) group several models trained on the same task
into a single model to aggregate predictions.

Intuition The intuition for ensembles come from the recognition that neural networks have many param-
eters, often with high variance. Then, if we have multiple learners, we can average out the variance.

Ensemble Methods There are two ways we typically proceed with ensemble methods:

1. Prediction Averaging. Train N neural networks independently. Then, average their predictions
(either probabilistically or by majority vote)
2. Parameter Averaging. Parameter averaging does not work in the same way as prediction averaging.
Instead, we would only average over parameters from the context of snapshot ensembles, and average
parameters over one trajectory, not over independent runs.

In practice, we do not need to reshuffle our dataset (and resample with replacement), since there is already
a lot of randomness in neural network training form weight initialization, minibatch shuffling and SGD.

Making Ensemble Methods Faster Unfortunately, a downside to ensemble methods is that they can
be very slow.

1. Only make classification layers (e.g., FC layers) ensembles.

2. Snapshot ensemble. Save out parameter snapshots over the course of SGD optimization and use each
snapshot as a model.

CS 182/282A, Spring 2021, Discussion 4 3

4 Dropout
Definition 2 (Dropout). Dropout is a popular technique for regularizing neural networks by randomly re-
moving nodes with probability 1 − pkeep in the forward pass. However, the model is unchanged at test time.

Intuition Dropout can be thought of as representing an ensemble of neural networks, since each forward
pass is effectively a different neural network, since random nodes are removed.

Activation Scaling A caveat about dropout is that we must divide the activation by p, since we do not
change the model at test time, but we notice that none of our dimensions will then be forced to 0. Below is
sample code to demonstrate how Dropout works in practice for a 3-layer network.
def dropout_train(X, p):
"""
Forward pass for a 3-layer network.
NOTE: For simplicity, we do not include backwards pass or parameter update

X: Input
p: Probability of keeping a unit active (e.g., higher p leads to less dropout)
"""
H1 = [Link](0, [Link](W1, X) + b1)
U1 = ([Link](*[Link]) < p) / p # first dropout mask. Notice /p
H1 *= U1 # Drop the activations
H2 = [Link](0, [Link](W2, H1) + b2)
U2 = ([Link](*[Link]) < p) / p # second dropout mask. Notice /p
H2 *= U2 # Drop the activations
out = [Link](W3, H2) + b3
return out

def predict(X):
""" Forward pass at test time """
H1 = [Link](0, [Link](W1, X) + b1)
H2 = [Link](0, [Link](W2, H1) + b2)
out = [Link](W3, H2) + b3
return out

Problem: Dropout Review

Explain why Dropout could improve performance and when we should use it

5 Weight Initialization
One of the reasons for poor model performance can be attributed to poor weight initialization. In class, we
discussed two types of weight initialization,

1. Basic initialization: Ensure activations are reasonable and they do not grow or shrink in later layers
(for example, Gaussian random weights or Xavier initialization)
2. Advanced initialization: Work with the eigenvalues of Jacobians

CS 182/282A, Spring 2021, Discussion 4 4

Problem: Deriving Xavier Initialization

Let our activation be the tanh activation, which is approximately linear with small inputs (i.e.,
Var(a) = Var(z), where z is the output of the activation followed by some linear layer). We furthermore
assume that weights and inputs are i.i.d. and centered at zero, and biases are initialized as zero.
We would like the magnitude of the variance to remain constant with each layer. Derive the Xavier
Initialization, which initializes each weight as,

1
Wij = N 0,
Da

where Da is the dimensionality of a

CS 182/282A, Spring 2021, Discussion 4 5

6 Aside: ReLU Activations and its Relatives
Definition 3 (ReLU Activation). ReLU Activation is defined as, ReLU(x) = max(0, x), and is a popular
activation function.

On top of ReLU Activation, there exists its close relatives, like:

• Leaky ReLU. Instead of defining the ReLU as 0 for all x < 0, Leaky ReLU defines it as a small linear
component of x.
• ELU. Instead of defining the ReLU as 0 for all x < 0, ELU defines it as α(ex − 1) for some α

Problem: (Review) Forward and Backward Pass for ReLU

Compute the output of forward pass of a ReLU layer with input x as given below:

y = ReLU(x)
 
1.5 2.2 1.3 6.7
 4.3 −0.3 −0.2 4.9
x= −4.5 1.4

5.5 1.8
0.1 −0.5 −0.1 2.2

With the gradients with respect to the outputs dL

dy given below, compute the gradient of the loss with
respect to the input x using the backward pass for a ReLU layer:
 
4.5 1.2 2.3 1.3
dL −1.3 −6.3 4.1 −2.9
= −0.5 1.2

dy 3.5 1.2 
−6.1 0.5 −4.1 −3.2

Problem: ReLU Potpourri

1. What advantages does using ReLU activations have over sigmoid activations?
2. ReLU layers have non-negative outputs. What is a negative consequence of this problem? What
layer types were developed to address this issue?

CS 182/282A, Spring 2021, Discussion 4 6

7 Summary
• Recall the main ConvNet architectures (LeNet, AlexNet, GoogLeNet, VGGNet, ResNet). In particular,
recall why bottleneck layers is ResNet are important.
• Batch Normalization proceeds by first computing empirical mean and variance, then rescaling each
activation and squashing through γ and β

• Ensembles group several models into a single model. To make this quicker, we can either only make
classification layers ensembles or use snapshot ensemble.
• Dropouts are methods for randomly removing nodes, and intuitively represent an ensemble of networks,
since each forward pass is effectively a different network

CS 182/282A, Spring 2021, Discussion 4 7

Common questions