notes_chapter_Convolutional_Neural_Networks
notes_chapter_Convolutional_Neural_Networks
So far, we have studied what are called fully connected neural networks, in which all of the
units at one layer are connected to all of the units in the next layer. This is a good arrange-
ment when we don’t know anything about what kind of mapping from inputs to outputs
we will be asking the network to learn to approximate. But if we do know something about
our problem, it is better to build it into the structure of our neural network. Doing so can
save computation time and significantly diminish the amount of training data required to
arrive at a solution that generalizes robustly.
One very important application domain of neural networks, where the methods have
achieved an enormous amount of success in recent years, is signal processing. Signals
might be spatial (in two-dimensional camera images or three-dimensional depth or CAT
scans) or temporal (speech or music). If we know that we are addressing a signal-processing
problem, we can take advantage of invariant properties of that problem. In this chapter, we
will focus on two-dimensional spatial problems (images) but use one-dimensional ones as
a simple example. Later, we will address temporal problems.
Imagine that you are given the problem of designing and training a neural network that
takes an image as input, and outputs a classification, which is positive if the image contains
a cat and negative if it does not. An image is described as a two-dimensional array of pixels, A pixel is a “picture ele-
each of which may be represented by three integer values, encoding intensity levels in red, ment.”
green, and blue color channels.
There are two important pieces of prior structural knowledge we can bring to bear on
this problem:
• Spatial locality: The set of pixels we will have to take into consideration to find a cat
will be near one another in the image. So, for example, we
won’t have to consider
• Translation invariance: The pattern of pixels that characterizes a cat is the same no some combination of
matter where in the image the cat occurs. pixels in the four cor-
ners of the image, in
We will design neural network structures that take advantage of these properties. order to see if they en-
code cat-ness.
61
MIT 6.036 Fall 2019 62
1 Filters
We begin by discussing image filters. An image filter is a function that takes in a local spatial Unfortunately in AI/M-
neighborhood of pixel values and detects the presence of some pattern in that data. L/CS/Math, the word
“filter” gets used in
Let’s consider a very simple case to start, in which we have a 1-dimensional binary
many ways: in addition
“image” and a filter F of size two. The filter is a vector of two numbers, which we will to the one we describe
move along the image, taking the dot product between the filter values and the image here, it can describe a
values at each step, and aggregating the outputs to produce a new image. temporal process (in
fact, our moving aver-
Let X be the original image, of size d; then pixel i of the the output image is specified ages are a kind of filter)
by and even a somewhat
Yi = F · (Xi−1 , Xi ) . esoteric algebraic struc-
ture.
To ensure that the output image is also of dimension d, we will generally “pad” the input
image with 0 values if we need to access pixels that are beyond the bounds of the input
image. This process of applying the filter to the image to create a new image is called
“convolution.” And filters are also
If you are already familiar with what a convolution is, you might notice that this def- sometimes called con-
volutional kernels.
inition corresponds to what is often called a correlation and not to a convolution. In-
deed, correlation and convolution refer to different operations in signal processing. How-
ever, in the neural networks literature, most libraries implement the correlation (as de-
scribed in this chapter) but call it convolution. The distinction is not significant; in prin-
ciple, if convolution is required to solve the problem, the network could learn the nec-
essary weights. For a discussion of the difference between convolution and correlation
and the conventions used in the literature you can read section 9.1 in this excellent book:
https://www.deeplearningbook.org.
Here is a concrete example. Let the filter F1 = (−1, +1). Then given the first image
below, we can convolve it with filter F1 to obtain the second image. You can think of this
filter as a detector for “left edges” in the original image—to see this, look at the places
where there is a 1 in the output image, and see what pattern exists at that position in the
input image. Another interesting filter is F2 = (−1, +1, −1). The third image below shows
the result of convolving the first image with F2 .
Study Question: Convince yourself that filter F2 can be understood as a detector for
isolated positive pixels in the binary image.
Image: 0 0 1 1 1 0 1 0 0 0
F1 : -1 +1
F2 -1 +1 -1
images. Computer vision people used to spend a lot of time hand-designing filter banks. A
filter bank is a set of sets of filters, arranged as shown in the diagram below.
Image
All of the filters in the first group are applied to the original image; if there are k such
filters, then the result is k new images, which are called channels. Now imagine stacking
all these new images up so that we have a cube of data, indexed by the original row and
column indices of the image, as well as by the channel. The next set of filters in the filter
bank will generally be three-dimensional: each one will be applied to a sub-range of the row
and column indices of the image and to all of the channels.
These 3D chunks of data are called tensors. The algebra of tensors is fun, and a lot like We will use a popular
matrix algebra, but we won’t go into it in any detail. piece of neural-network
software called Tensor-
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters
flow because it makes
in the first layer, f1 and f2 . You can think of each one as “looking” for three pixels in a operations on tensors
row, f1 vertically and f2 horizontally. Assuming our input image is n × n, then the result easy.
of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter
(hard to draw!) that “looks for” a combination of two horizontal and two vertical bars
(now represented by individual pixels in the two channels), resulting in a single final n × n
image. When we have a color
image as input, we treat
it as having 3 channels,
and hence as an n×n×3
f2 tensor.
tensor
filter
f1
We are going to design neural networks that have this structure. Each “bank” of the
filter bank will correspond to a neural-network layer. The numbers in the individual fil-
ters will be the “weights” (plus a single additive bias or offset value for each filter) of the
network, which we will train using gradient descent. What makes this interesting and
powerful (and somewhat confusing at first) is that the same weights are used many many
times in the computation of each layer. This weight sharing means that we can express a
transformation on a large image with relatively few parameters; it also means we’ll have
to take care in figuring out exactly how to train it!
• padding: pl is how many extra pixels – typically with value 0 – we add around the
edges of the input. For an input of size nl−1 × nl−1 × ml−1 , our new effective input
size with padding becomes (nl−1 + 2 ∗ pl ) × (nl−1 + 2 ∗ pl ) × ml−1 .
Study Question: If we used a fully-connected layer with the same size inputs and
outputs, how many weights would it have?
2 Max Pooling
It is typical to structure filter banks into a pyramid, in which the image sizes get smaller in Both in engineering and
successive layers of processing. The idea is that we find local patterns, like bits of edges in nature
in the early layers, and then look for patterns in those patterns, etc. This means that, ef-
fectively, we are looking for patterns in larger pieces of the image as we apply successive
filters. Having a stride greater than one makes the images smaller, but does not necessarily
aggregate information over that spatial range.
Another common layer type, which accomplishes this aggregation, is max pooling. A
max pooling layer operates like a filter, but has no weights. You can think of it as a pure
functional layer, like a ReLU layer in a fully connected network. It has a filter size, as in a filter
layer, but simply returns the maximum value in its field. Usually, we apply max pooling We sometimes use the
with the following traits: term receptive field or
just field to mean the
• stride > 1, so that the resulting image is smaller than the input image; and area of an input image
that a filter is being ap-
• k > stride, so that the whole image is covered. plied to.
1 Here, d·e is known as the ceiling function; it returns the smallest integer greater than or equal to its input.
As a result of applying a max pooling layer, we don’t keep track of the precise location of a
pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer of stride = k = 2. This would map a 64 × 64 × 3 image
to a 32 × 32 × 3 image. Note that max pooling layers do not have additional bias or offset
values.
Study Question: Maximilian Poole thinks it would be a good idea to add two max
pooling layers of size k, one right after the other, to their network. What single layer
would be equivalent?
3 Typical architecture
Here is the form of a typical convolutional network:
Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-
network.html
After each filter layer there is generally a ReLU layer; there maybe be multiple fil-
ter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max
pooling. Once the output is down to a relatively small size, there is typically a last fully-
connected layer, leading into an activation function such as softmax that produces the final
output. The exact design of these structures is an art—there is not currently any clear the-
oretical (or even systematic empirical) understanding of how these various design choices
affect overall performance of the network.
The critical point for us is that this is all just a big neural network, which takes an input
and computes an output. The mapping is a differentiable function of the weights, which Well, the derivative is
means we can adjust the weights to decrease the loss by performing gradient descent, and not continuous, both be-
cause of the ReLU and
we can compute the relevant gradients using back-propagation!
the max pooling oper-
Let’s work through a very simple example of how back-propagation can work on a con- ations, but we ignore
volutional network. The architecture is shown below. Assume we have a one-dimensional that fact.
single-channel image, of size n × 1 × 1 and a single k × 1 × 1 filter (where we omit the
filter bias) in the first convolutional layer. Then we pass it through a ReLU layer and a
fully-connected layer with no additional activation function on the output.
conv ReLU fc
Z2 = A2
W1
For simplicity assume k is odd, let the input image X = A0 , and assume we are using
squared loss. Then we can describe the forward pass as follows:
T
Z1i = W 1 · A0[i−bk/2c:i+bk/2c]
A1 = ReLU(Z1 )
T
A2 = W 2 A1
L(A2 , y) = (A2 − y)2
Study Question: For a filter of size k, how much padding do we need to add to the
top and bottom of the image?
How do we update the weights in filter W 1 ?