0% found this document useful (0 votes)
9 views

notes_chapter_Convolutional_Neural_Networks

This chapter discusses Convolutional Neural Networks (CNNs), which are designed to leverage prior knowledge about spatial locality and translation invariance in image processing. It introduces concepts such as filters, convolution, and max pooling, explaining how these elements work together to reduce the dimensionality of images while preserving essential features. The chapter also outlines the typical architecture of CNNs, emphasizing the importance of weight sharing and the use of activation functions in training the network.

Uploaded by

Tachbir Dewan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

notes_chapter_Convolutional_Neural_Networks

This chapter discusses Convolutional Neural Networks (CNNs), which are designed to leverage prior knowledge about spatial locality and translation invariance in image processing. It introduces concepts such as filters, convolution, and max pooling, explaining how these elements work together to reduce the dimensionality of images while preserving essential features. The chapter also outlines the typical architecture of CNNs, emphasizing the importance of weight sharing and the use of activation functions in training the network.

Uploaded by

Tachbir Dewan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CHAPTER 9

Convolutional Neural Networks

So far, we have studied what are called fully connected neural networks, in which all of the
units at one layer are connected to all of the units in the next layer. This is a good arrange-
ment when we don’t know anything about what kind of mapping from inputs to outputs
we will be asking the network to learn to approximate. But if we do know something about
our problem, it is better to build it into the structure of our neural network. Doing so can
save computation time and significantly diminish the amount of training data required to
arrive at a solution that generalizes robustly.
One very important application domain of neural networks, where the methods have
achieved an enormous amount of success in recent years, is signal processing. Signals
might be spatial (in two-dimensional camera images or three-dimensional depth or CAT
scans) or temporal (speech or music). If we know that we are addressing a signal-processing
problem, we can take advantage of invariant properties of that problem. In this chapter, we
will focus on two-dimensional spatial problems (images) but use one-dimensional ones as
a simple example. Later, we will address temporal problems.
Imagine that you are given the problem of designing and training a neural network that
takes an image as input, and outputs a classification, which is positive if the image contains
a cat and negative if it does not. An image is described as a two-dimensional array of pixels, A pixel is a “picture ele-
each of which may be represented by three integer values, encoding intensity levels in red, ment.”
green, and blue color channels.
There are two important pieces of prior structural knowledge we can bring to bear on
this problem:

• Spatial locality: The set of pixels we will have to take into consideration to find a cat
will be near one another in the image. So, for example, we
won’t have to consider
• Translation invariance: The pattern of pixels that characterizes a cat is the same no some combination of
matter where in the image the cat occurs. pixels in the four cor-
ners of the image, in
We will design neural network structures that take advantage of these properties. order to see if they en-
code cat-ness.

Cats don’t look differ-


ent if they’re on the left
or the right side of the
image.

61
MIT 6.036 Fall 2019 62

1 Filters
We begin by discussing image filters. An image filter is a function that takes in a local spatial Unfortunately in AI/M-
neighborhood of pixel values and detects the presence of some pattern in that data. L/CS/Math, the word
“filter” gets used in
Let’s consider a very simple case to start, in which we have a 1-dimensional binary
many ways: in addition
“image” and a filter F of size two. The filter is a vector of two numbers, which we will to the one we describe
move along the image, taking the dot product between the filter values and the image here, it can describe a
values at each step, and aggregating the outputs to produce a new image. temporal process (in
fact, our moving aver-
Let X be the original image, of size d; then pixel i of the the output image is specified ages are a kind of filter)
by and even a somewhat
Yi = F · (Xi−1 , Xi ) . esoteric algebraic struc-
ture.
To ensure that the output image is also of dimension d, we will generally “pad” the input
image with 0 values if we need to access pixels that are beyond the bounds of the input
image. This process of applying the filter to the image to create a new image is called
“convolution.” And filters are also
If you are already familiar with what a convolution is, you might notice that this def- sometimes called con-
volutional kernels.
inition corresponds to what is often called a correlation and not to a convolution. In-
deed, correlation and convolution refer to different operations in signal processing. How-
ever, in the neural networks literature, most libraries implement the correlation (as de-
scribed in this chapter) but call it convolution. The distinction is not significant; in prin-
ciple, if convolution is required to solve the problem, the network could learn the nec-
essary weights. For a discussion of the difference between convolution and correlation
and the conventions used in the literature you can read section 9.1 in this excellent book:
https://www.deeplearningbook.org.
Here is a concrete example. Let the filter F1 = (−1, +1). Then given the first image
below, we can convolve it with filter F1 to obtain the second image. You can think of this
filter as a detector for “left edges” in the original image—to see this, look at the places
where there is a 1 in the output image, and see what pattern exists at that position in the
input image. Another interesting filter is F2 = (−1, +1, −1). The third image below shows
the result of convolving the first image with F2 .
Study Question: Convince yourself that filter F2 can be understood as a detector for
isolated positive pixels in the binary image.

Image: 0 0 1 1 1 0 1 0 0 0

F1 : -1 +1

After convolution (w/ F1 ): 0 1 0 0 -1 1 -1 0 0

F2 -1 +1 -1

After convolution (w/ F2 ): -1 0 -1 0 -2 1 -1 0


Two-dimensional versions of filters like these are thought to be found in the visual
cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 63

images. Computer vision people used to spend a lot of time hand-designing filter banks. A
filter bank is a set of sets of filters, arranged as shown in the diagram below.

Image

All of the filters in the first group are applied to the original image; if there are k such
filters, then the result is k new images, which are called channels. Now imagine stacking
all these new images up so that we have a cube of data, indexed by the original row and
column indices of the image, as well as by the channel. The next set of filters in the filter
bank will generally be three-dimensional: each one will be applied to a sub-range of the row
and column indices of the image and to all of the channels.
These 3D chunks of data are called tensors. The algebra of tensors is fun, and a lot like We will use a popular
matrix algebra, but we won’t go into it in any detail. piece of neural-network
software called Tensor-
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters
flow because it makes
in the first layer, f1 and f2 . You can think of each one as “looking” for three pixels in a operations on tensors
row, f1 vertically and f2 horizontally. Assuming our input image is n × n, then the result easy.
of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter
(hard to draw!) that “looks for” a combination of two horizontal and two vertical bars
(now represented by individual pixels in the two channels), resulting in a single final n × n
image. When we have a color
image as input, we treat
it as having 3 channels,
and hence as an n×n×3
f2 tensor.

tensor
filter

f1

We are going to design neural networks that have this structure. Each “bank” of the
filter bank will correspond to a neural-network layer. The numbers in the individual fil-
ters will be the “weights” (plus a single additive bias or offset value for each filter) of the
network, which we will train using gradient descent. What makes this interesting and
powerful (and somewhat confusing at first) is that the same weights are used many many
times in the computation of each layer. This weight sharing means that we can express a
transformation on a large image with relatively few parameters; it also means we’ll have
to take care in figuring out exactly how to train it!

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 64

We will define a filter layer l formally with: For simplicity, we are


assuming that all im-
• number of filters m ; l
ages and filters are
square (having the same
• size of one filter is kl × kl × ml−1 plus 1 bias value (for this one filter); number of rows and
columns). That is in no
• stride sl is the spacing at which we apply the filter to the image; in all of our examples way necessary, but is
so far, we have used a stride of 1, but if we were to “skip” and apply the filter only at usually fine and def-
initely simplifies our
odd-numbered indices of the image, then it would have a stride of two (and produce notation.
a resulting image of half the size);

• input tensor size nl−1 × nl−1 × ml−1

• padding: pl is how many extra pixels – typically with value 0 – we add around the
edges of the input. For an input of size nl−1 × nl−1 × ml−1 , our new effective input
size with padding becomes (nl−1 + 2 ∗ pl ) × (nl−1 + 2 ∗ pl ) × ml−1 .

This layer will produce an output tensor of size nl × nl × ml , where nl = d(nl−1 + 2 ∗ pl −


(kl − 1))/sl e.1 The weights are the values defining the filter: there will be ml different kl ×
kl ×ml−1 tensors of weight values; plus each filter may have a bias term, which means there
is one more weight value per filter. A filter with a bias operates just like the filter examples
above, except we add the bias to the output. For instance, if we incorporated a bias term
of 0.5 into the filter F2 above, the output would be (−0.5, 0.5, −0.5, 0.5, −1.5, 1.5, −0.5, 0.5)
instead of (−1, 0, −1, 0, −2, 1, −1, 0).
This may seem complicated, but we get a rich class of mappings that exploit image
structure and have many fewer weights than a fully connected layer would.
Study Question: How many weights are in a convolutional layer specified as
above?

Study Question: If we used a fully-connected layer with the same size inputs and
outputs, how many weights would it have?

2 Max Pooling
It is typical to structure filter banks into a pyramid, in which the image sizes get smaller in Both in engineering and
successive layers of processing. The idea is that we find local patterns, like bits of edges in nature
in the early layers, and then look for patterns in those patterns, etc. This means that, ef-
fectively, we are looking for patterns in larger pieces of the image as we apply successive
filters. Having a stride greater than one makes the images smaller, but does not necessarily
aggregate information over that spatial range.
Another common layer type, which accomplishes this aggregation, is max pooling. A
max pooling layer operates like a filter, but has no weights. You can think of it as a pure
functional layer, like a ReLU layer in a fully connected network. It has a filter size, as in a filter
layer, but simply returns the maximum value in its field. Usually, we apply max pooling We sometimes use the
with the following traits: term receptive field or
just field to mean the
• stride > 1, so that the resulting image is smaller than the input image; and area of an input image
that a filter is being ap-
• k > stride, so that the whole image is covered. plied to.

1 Here, d·e is known as the ceiling function; it returns the smallest integer greater than or equal to its input.

E.g., d2.5e = 3 and d3e = 3.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 65

As a result of applying a max pooling layer, we don’t keep track of the precise location of a
pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer of stride = k = 2. This would map a 64 × 64 × 3 image
to a 32 × 32 × 3 image. Note that max pooling layers do not have additional bias or offset
values.
Study Question: Maximilian Poole thinks it would be a good idea to add two max
pooling layers of size k, one right after the other, to their network. What single layer
would be equivalent?

3 Typical architecture
Here is the form of a typical convolutional network:

Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-
network.html

After each filter layer there is generally a ReLU layer; there maybe be multiple fil-
ter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max
pooling. Once the output is down to a relatively small size, there is typically a last fully-
connected layer, leading into an activation function such as softmax that produces the final
output. The exact design of these structures is an art—there is not currently any clear the-
oretical (or even systematic empirical) understanding of how these various design choices
affect overall performance of the network.
The critical point for us is that this is all just a big neural network, which takes an input
and computes an output. The mapping is a differentiable function of the weights, which Well, the derivative is
means we can adjust the weights to decrease the loss by performing gradient descent, and not continuous, both be-
cause of the ReLU and
we can compute the relevant gradients using back-propagation!
the max pooling oper-
Let’s work through a very simple example of how back-propagation can work on a con- ations, but we ignore
volutional network. The architecture is shown below. Assume we have a one-dimensional that fact.
single-channel image, of size n × 1 × 1 and a single k × 1 × 1 filter (where we omit the
filter bias) in the first convolutional layer. Then we pass it through a ReLU layer and a
fully-connected layer with no additional activation function on the output.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 66

conv ReLU fc

Z2 = A2
W1

pad with 0’s


(to get output 0
of same shape) Z1 A1
0
X=A

For simplicity assume k is odd, let the input image X = A0 , and assume we are using
squared loss. Then we can describe the forward pass as follows:
T
Z1i = W 1 · A0[i−bk/2c:i+bk/2c]
A1 = ReLU(Z1 )
T
A2 = W 2 A1
L(A2 , y) = (A2 − y)2

Study Question: For a filter of size k, how much padding do we need to add to the
top and bottom of the image?
How do we update the weights in filter W 1 ?

∂loss ∂Z1 ∂A1 ∂loss


= · ·
∂W 1 ∂W 1 ∂Z1 ∂A1
• ∂Z1 /∂W 1 is the k×n matrix such that ∂Z1i /∂Wj1 = Xi−bk/2c+j−1 . So, for example, if i =
10, which corresponds to column 10 in this matrix, which illustrates the dependence
of pixel 10 of the output image on the weights, and if k = 5, then the elements in
column 10 will be X8 , X9 , X10 , X11 , X12 .

• ∂A1 /∂Z1 is the n × n diagonal matrix such that



1 1 1 if Z1i > 0
∂Ai /∂Zi =
0 otherwise

• ∂loss/∂A1 = ∂loss/∂A2 · ∂A2 /∂A1 = 2(A2 − y)W 2 , an n × 1 vector

Multiplying these components yields the desired gradient, of shape k × 1.

Last Updated: 12/18/19 11:56:05

You might also like