9.CNN-1
9.CNN-1
Convolutional
Neural Networks
I2DL: Prof. Dai 1
Fully Connected Neural Network
Width
Depth
I2DL: Prof. Dai 2
Problems using FC Layers on Images
• How to process a tiny image with FC layers
5 weights
5
3 3 neuron layer
25 weights
For the whole 5 × 5
image on 1
5 channel
5
3 3 neuron layer
75 weights
For the whole 5 × 5
image on the 3
5 channel
5
3 3 neuron layer
75 weights
For the whole
5 × 5 image on
75 weights the three
5 channels per
neuron
75 weights
5
3 3 neuron layer
1000
1000
3 3 neuron layer
1000
3 1000 neuron layer
[Li et al., CS231n Course Slides] Lecture 12: Detection and Segmentation
I2DL: Prof. Dai 11
Convolutions
𝑓 ∗ 𝑔 = න 𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏
−∞
𝑓 = red
𝑔 = blue
𝑓 ∗ 𝑔 = green
𝑓∗𝑔 3
1 1 1
4⋅ +3⋅ +2⋅ = 3
3 3 3
𝑓∗𝑔 3 0
1 1 1
3⋅ + 2 ⋅ + (−5) ⋅ = 0
3 3 3
𝑓∗𝑔 3 0 0
1 1 1
2⋅ + (−5) ⋅ + 3 ⋅ = 0
3 3 3
𝑓∗𝑔 3 0 0 1
1 1 1
−5 ⋅ +3⋅ +5⋅ =1
3 3 3
𝑓∗𝑔 3 0 0 1 10/3
1 1 1 10
3⋅ +5⋅ +2⋅ =
3 3 3 3
𝑓∗𝑔 3 0 0 1 10/3 4
1 1 1
5⋅ +2⋅ +5⋅ = 4
3 3 3
𝑓∗𝑔 3 0 0 1 10/3 4 4
1 1 1
2⋅ +5⋅ +5⋅ = 4
3 3 3
1 1 1 16
5⋅ +5⋅ +6⋅ =
3 3 3 3
?? 3 0 0 1 10/3 4 4 16/3 ??
What to do at boundaries?
?? 3 0 0 1 10/3 4 4 16/3 ??
What to do at boundaries?
Option 1: Shrink
3 0 0 1 10/3 4 4 16/3
I2DL: Prof. Dai 24
What are Convolutions?
Discrete case: box filter
0 4 3 2 -5 3 5 2 5 5 6 0
?? 3 0 0 1 10/3 4 4 16/3 ??
1 1 1 7 What to do at boundaries?
0⋅ +4⋅ +3⋅ =
3 3 3 3 Option 2: Pad (often 0’s)
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6
5 6 7 9 -1
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 3 + −1 ⋅ 3 + −1 ⋅ 2 + −1 ⋅ 0 + −1 ⋅ 4
0 -1 0 = 15 − 9 = 6
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1
5 6 7 9 -1
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 2 + −1 ⋅ 2 + −1 ⋅ 1 + −1 ⋅ 3 + −1 ⋅ 3
0 -1 0 = 10 − 9 = 1
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 1 + −1 ⋅ −5 + −1 ⋅ −3 + −1 ⋅ 3
0 -1 0 + −1 ⋅ 2
=5+3= 8
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 0 + −1 ⋅ 3 + −1 ⋅ 0 + −1 ⋅ 1 + −1 ⋅ 3
0 -1 0 = 0 − 7 = −7
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7 9
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 3 + −1 ⋅ 2 + −1 ⋅ 3 + −1 ⋅ 1 + −1 ⋅ 0
0 -1 0 = 15 − 6 = 9
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7 9 2
Kernel 3 × 3
0 -1 0
-1 5 -1 5 ⋅ 3 + −1 ⋅ 1 + −1 ⋅ 5 + −1 ⋅ 4 + −1 ⋅ 3
0 -1 0 = 15 − 13 = 2
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7 9 2
Kernel 3 × 3
0 -1 0 -5
-1 5 -1 5 ⋅ 0 + −1 ⋅ 0 + −1 ⋅ 1 + −1 ⋅ 6
0 -1 0 + −1 ⋅ −2
= −5
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7 9 2
Kernel 3 × 3
0 -1 0 -5 -9
-1 5 -1 5 ⋅ 1 + −1 ⋅ 3 + −1 ⋅ 4 + −1 ⋅ 7 + −1 ⋅ 0
0 -1 0 = 5 − 14 = −9
4 3 2 1 -3
1 0 3 3 5
-2 0 1 4 4
Output 3 × 3
6 1 8
5 6 7 9 -1
-7 9 2
Kernel 3 × 3
0 -1 0 -5 -9 3
-1 5 -1 5 ⋅ 4 + −1 ⋅ 3 + −1 ⋅ 4 + −1 ⋅ 9 + −1 ⋅ 1
0 -1 0 = 20 − 17 = 3
filter 5 × 5 × 3
32
Images have depth: e.g. RGB -> 3 channels
3
I2DL: Prof. Dai 36
Convolutions on RGB Images
32 × 32 × 3 image (pixels 𝑿)
1 number at a time:
equal to dot product between
32 filter weights 𝒘 and 𝒙𝒊 − 𝑡ℎ chunk of
5 the image. Here: 5 ⋅ 5 ⋅ 3 = 75-dim
𝑧𝑖 dot product + bias
5
3 𝑧𝑖 = 𝒘𝑇 𝒙𝑖 + 𝑏
32
5×5×3 ×1 (5 × 5 × 3) × 1 1
3
I2DL: Prof. Dai 37
Convolutions on RGB Images
Activation map
32 × 32 × 3 image (also feature map)
5 × 5 × 3 filter
28
32
5 Convolve
5
3 Slide over all spatial locations 𝑥𝑖
and compute all output 𝑧𝑖 ; 28
32 w/o padding, there are 1
28 × 28 locations
3
I2DL: Prof. Dai 38
Convolution Layer
5 × 5 × 3 filter
28
32
5 Convolve
5
3
Let’s apply a different filter 28
32 with different weights! 11
3
I2DL: Prof. Dai 40
Convolution Layer
Convolution “Layer”
32 × 32 × 3 image Activation maps
32 28
Convolve
Filter height of 𝑭
Stride: 𝑆
Input height of 𝑵
Output: 𝑁−𝐹
+1 ×
𝑁−𝐹
+1
𝑆 𝑆
Filter width of 𝑭
7−3
𝑁 = 7, 𝐹 = 3, 𝑆 = 1: + 1=5
1
7−3
𝑁 = 7, 𝐹 = 3, 𝑆 = 2: + 1=3
2
7−3
𝑁 = 7, 𝐹 = 3, 𝑆 = 3: + 1 = 2. 3ത
3
Fractions are illegal
I2DL: Prof. Dai 57
Convolution Layers: Dimensions
Input Image
0 0
0 0
• Sizes get small too quickly
0 0
• Corner pixel is only used
0 0
once
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
0 0 Padding (𝑃): 1
0 0 Stride (𝑆): 1
0 0 Output 7×7
0 0 Most common is ‘zero’ padding
0 0
Output Size:
0 0
0 0 𝑁+2⋅𝑃−𝐹 𝑁+2⋅𝑃−𝐹
+1 × +1
𝑆 𝑆
0 0 0 0 0 0 0 0 0
denotes the floor operator (as in
practice an integer division is performed)
I2DL: Prof. Dai 61
Convolution Layers: Padding
0 0 0 0 0 0 0 0 0 Types of convolutions:
Image 7 × 7 + zero padding
0 0
0 0
• Valid convolution: using no
0 0
padding
0 0
0 0
0 0
• Same convolution:
output=input size
0 0
𝐹−1
0 0 0 0 0 0 0 0 0 Set padding to 𝑃 =
2
A1: (3, 4, 5, 5)
A2: (4, 5, 5)
A3: depends on the width and height of the image
3 1 3 5 ‘Pooled’ output
Max pool with
6 0 7 9 2 × 2 filters and stride 2 6 9
3 2 1 4 3 4
0 2 4 3
3 1 3 5 ‘Pooled’ output
Average pool with
6 0 7 9 2 × 2 filters and stride 2 2.5 6
3 2 1 4 1.75 3
0 2 4 3
=
3x3 filter 3x3 output
5x5 input
=
3x3 filter 3x3 output
• http://cs231n.github.io/convolutional-networks/