W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
Remember that in numpy, the operation * above denotes elementwise multiplication between
the arrays. Notice also that the weight vector W0 is the weight vector of that neuron and b0 is
the bias. Here, W0 is assumed to be of shape W0.shape: (5,5,4) , since the filter size is 5
and the depth of the input volume is 4. Notice that at each point, we are computing the dot
product as seen before in ordinary neural networks. Also, we see that we are using the same
weight and bias (due to parameter sharing), and where the dimensions along the width are
increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output
volume, we would have:
where we see that we are indexing into the second depth dimension in V (at index 1) because
we are computing the second activation map, and that a different set of parameters ( W1 ) is now
used. In the example above, we are for brevity leaving out some of the other operations the Conv
Layer would perform to fill the other parts of the output array V . Additionally, recall that these
activation maps are often followed elementwise through an activation function such as ReLU, but
this is not shown here.
Convolution Demo.
Demo Below is a running demo of a CONV layer. Since 3D volumes are hard to
visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output
volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of
size W1 = 5, H1 = 5, D1 = 3 , and the CONV layer parameters are
K = 2, F = 3, S = 2, P = 1 . That is, we have two filters of size 3 × 3 , and they are applied
with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3.
Moreover, notice that a padding of P = 1 is applied to the input volume, making the outer border
of the input volume zero. The visualization below iterates over the output activations (green), and
shows that each element is computed by elementwise multiplying the highlighted input (blue)
with the filter (red), summing it up, and then offsetting the result by the bias.
Input Volume (+pad 1) (7x7x3) Filter W0 (3x3x3) Filter W1 (3x3x3) Output Vo
x[:,:,0] w0[:,:,0] w1[:,:,0] o[:,:,0]
0 0 0 0 0 0 0 1 0 1 0 0 0 2 4 3
0 0 2 1 2 2 0 -1 1 0 0 0 0 5 1 2
0 0 0 2 2 1 0 0 1 1 1 -1 0 4 0 1
0 0 0 2 2 1 0 w0[:,:,1] w1[:,:,1] o[:,:,1]
-1 -1 1 -1 -1 1 1 -2 1
0 0 2 1 1 2 0
0 2 2 2 0 1 0 -1 0 0 1 0 0 2 -1 -7
0 0 0 0 0 0 0 1 -1 0 1 -1 0 -1 4 -4
0 0 1 1 0 2 0 0 0 1 -1 0 1
0 1 -1 0 1 -1
0 1 2 2 2 1 0
0 0 1 2 0 0 0
Bias b0 (1x1x1) Bias b1 (1x1x1)
0 2 1 2 1 0 0 b0[:,:,0] b1[:,:,0]
0 2 2 0 1 0 0 1 0
0 0 0 0 0 0 0
x[:,:,2] toggle movement
0 0 0 0 0 0 0
0 2 2 2 1 0 0
0 0 0 2 2 0 0
0 1 2 0 2 0 0
0 1 0 0 2 0 0
0 0 0 0 2 2 0
0 0 0 0 0 0 0
1. The local regions in the input image are stretched out into columns in an operation
commonly called im2col
im2col. For example, if the input is [227x227x3] and it is to be convolved
with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input