Chapitre-8-2024 (1)
Chapitre-8-2024 (1)
CHAPITRE 8
APPRENTISSAGE PROFOND
ET ROBOTS MOBILES
2
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3
1a. What do you see, and how?
Can we teach machines to see?
4
What do you see?
5
How do you see?
9
Inspiration: human/animal visual cortex
• Chemical accumulation across • Each neuron receives multiple • Weak stimuli ignored
dendritic connections signals from its many dendrites • Sufficiently strong cross
• Pre-synaptic axon • When threshold crossed, it fires activation threshold
post-synaptic dendrite • Its axon then sends outgoing • Non-linearity within
neuronal cell body signal to downstream neurons each neuronal level
• Neurons connected into circuits (neural networks): emergent properties, learning, memory
• Simple primitives arranged in simple, repetitive, and extremely large networks
• 86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections 11
Abstraction layers: edges, bars, dir., shapes, objects, scenes
V1: Orientation,
disparity, some color
Hardware
expansion
chimp human
15
Using Spatial Structure
16
Using Spatial Structure
17
Feature Extraction with Convolution
- Filter of size 4x4 :16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
18
Fully Connected Neural Network
20
Fully Connected Neural Network
21
2b. Convolutions and filters
22
Convolution operation is element wise
multiply and add
Filter / Kernel
23
CNNs for Classification
2. Non-linearity:Often ReLU.
3. Pooling: Downsampling operation on each feature map.
Train model with image data.
Learn weights of filters in convolutional layers.
tf.keras.layers.Conv2 24
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
25
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
26
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 -10 1 -1 0 Filter 1 1
0 1 0 -10 -1 1 1 0
0 0 1 1 0 0
1 0 0 -10 1 -1 1 0
0 1 0 -10 1 -1 1 Filter 2 0
0 0 1 -10 1 -1 1 0
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3). 27
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
28
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
29
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
30
Convolution -1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map
0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images 31
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected …
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
33
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to
16 1 9 inputs, not
fully connected
…
34
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
35
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
36
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
37
Why Pooling
Subsampling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling
cat dog ……
Convolution
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are 25
-1 -1 1
-1 1 -1 … (filters) 3x3 (dimension). Max Pooling
Input_shape = ( 28 , 28 , 1)
3 -1 3 Max Pooling
-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
Convolution
How many parameters for
9 25 x 26 x 26
each filter?
Max Pooling
25 x 13 x 13
Convolution
How many parameters 225= 50 x 11 x 11
for each filter? 25x9
Max Pooling
50 x 5 x 5
Reminder:
Output size = (N+2P-F)/stride + 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
25 x 26 x 26
Fully connected Max Pooling 25 , 2 x 2
feedforward network
25 x 13 x 13
Convolution 50F, 3 x3
50 x 11 x 11
Max Pooling 50 , 2 x 2
1250 50 x 5 x 5
Flattened
Reminder:
Output size = (N+2P-F)/stride + 1
AlphaGo*
Next move
Neural
(19 x 19
Network positions)
19 x 19 matrix
Black: 1 Fully-connected feedforward
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo est un programme informatique capable de jouer au jeu de go (Japonais),
développé par l'entreprise britannique DeepMind et racheté en 2014 par Google.
AlphaGo’s policy network
Image Time
Spectrogram
CNN in text classification
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
CNN: An Example (Animation)
padding
Reminder:
Output size = (N+2P-F)/stride + 1
No padding, stride 2 Padding 1, stride 2
padding adds additional rows and columns of pixels Stride is how far the filter moves in every step
around the edges of the input data so that the size of along one direction
the output feature map is the same as the size of the *source:
https://github.com/vdu
input data. moulin/conv_arithmetic
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32
32
3
Output volume size = ?
Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10
Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
3 10
Number of parameters in this layer?
Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760
NB of parameters = ( (width of filter * height of filter * number of images in the previous layer +1 ) *
number of filters)
Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: Convolution
• ConvNet is a sequence of Convolutional layers, followed by non-linearity
32×32×3 image
32 28 24
20
ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]
Reminder:
Output size = (N+2P-F)/stride + *reference:
1 http://cs231n.stanford.edu/2017/
*Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling
• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect
224×224×64
112×112×64
Pooling
224 112
Downsampling
112
224
*reference : http://cs231n.stanford.edu/2017/ 54
CNN: Pooling
• Max pooling and average pooling
• With 2×2 filters and stride 2
ROI pooling
*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization
• Visualization of CNN feature representations [Zeiler et al., 2014]
• VGG-16 [Simonyan et al., 2015]
*reference : http://cs231n.stanford.edu/2017/ 56
CNN in Computer Vision: Everywhere
*reference :
http://cs231n.
stanford.edu/2
CNN in Computer Vision:
Self-driving cars Human pose estimation [Cae et al., 2017]
Everywhere
*reference :
http://cs231n.
stanford.edu/2
Automated Vehicles/Robots and Transportation
Producing Feature Maps
64
A simple pattern: Edges
How can we detect edges with a kernel?
Input
-1 -1 Output
Filter
65
(Goodfellow 2016)
Simple Kernels* / Filters
66
* the central or most important part of something, a central or basic part
X or X?
67
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution
68
Zero Padding Controls Output Size
(Goodfellow 2016)
• Same convolution: zero pad input so output • Valid-only convolution: output only when
is same size as input dimensions entire kernel contained in input (shrinks output)
•Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
71
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)
72
Lee+ ICML 2009
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
73
Feature Extraction with Convolution
75
[LeCun et al., 1998]
LeNet-5
conv avg pool conv avg pool
...
5 5 f=2 5 5 f=2
s=1 s=2 s=1 s=2
32 32 1 28 28 6 14 14 6 10 10 16
FC FC
...
10
5 5 16
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
76
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5
77
[LeCun et al., 1998]
Backpropagation of convolution
78
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
79
An image classification CNN
80
Representation Learning in Deep CNNs
81
Lee+ ICML 2009
CNNs for Classification
83
Convolutional Layers: Local Connectivity
tf.keras.layers.
Conv2D
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
84
Convolutional Layers: Local Connectivity
tf.keras.layers.Conv2D
4x4 filter:
1) applying a window of weights
matrix of 2) computing linear combinations
weights wij for neuron (p,q) in hidden layer 85
3) activating with non-linear function
CNNs: Spatial Arrangement of Output
Volume
depth
Layer Dimensions:
ℎ w d
where h and w are spatial
dimensions d (depth) = number of
height filters
Stride:
Filter step size
Receptive Field:
width Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
86
Introducing Non-Linearity
- Apply after every convolution operation
(i.e.,after convolutional layers) Rectified Linear Unit
- ReLU:pixel-by-pixel operation that replaces (ReLU)
all negative values by zero.
- Non-linear operation
tf.keras.layers.ReLU
87
Karn Intuitive CNNs
Pooling
tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
) strides=2 1) Reduced
dimensionality
2) Spatial invariance
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it? 89
Pooling reduces dimensionality by giving up
spatial location
•max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
90
Dilated Convolution
91
CNNs for Classification: Feature Learning
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
])
return model
94
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
95
4a. Real-world feature invariance is
hard
96
How can computers recognize objects?
97
How can computers recognize objects?
Challenge:
•Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
•How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them 98
Feature invariance to perturbation is hard
99
Li/Johnson/Yeung C231n
Next-generation models
explode # of parameters
100
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
101
[LeCun et al., 1998]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
102
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
103
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.
105
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture AlexNet
CONV1
• Input: 227x227x3 images (224x224 before
MAX POOL1
padding)
NORM1
CONV2 • First layer: 96 11x11 filters applied at stride 4
MAX POOL2
NORM2 • Output volume size?
CONV3 (N-F)/s+1 = (227-11)/4+1 = 55 ->
CONV4 [55x55x96]
CONV5
Max POOL3 • Number of parameters in this
FC6 layer? (11*11*3)*96 = 35K
FC7 Reminder:
Output size = (N+2P-F)/stride + 1
FC8
106
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
107
[Krizhevsky et al., 2012]
Architecture AlexNet
CONV1
MAX POOL1 • Input: 227x227x3 images (224x224 before
padding)
NORM1
CONV2 • After CONV1: 55x55x96
MAX POOL2 • Second layer: 3x3 filters applied at stride 2
NORM2
CONV3 • Output volume size?
CONV4 (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96]
CONV5
Max POOL3 • Number of parameters in this
FC6 layer? 0!
FC7
FC8
108
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
conv max pool conv max pool
...
11 11 3 3 5 5 3 3
s=4 s=2 S=1 s=2
227 227 3 P = 0 55 55 6 27 27 96 P=2 27 27 256
FC FC
...
Softmax
1000
4096 4096
110
This slide is taken from Andrew Ng [Krizhevsky et al., 2012]
AlexNet
Details/Retrospectives:
• first use of ReLU
• used Norm layers (not common anymore)
• heavy data augmentation
• dropout 0.5
• batch size 128
• 7 CNN ensemble
111
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
• Trained on GTX 580 GPU with only 3 GB of memory.
112
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.
113
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
114
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
115
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
116
[Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64 VGGNet
Pool 1/2
3x3 conv, 128
3x3 conv, 128 • Smaller filters
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 256
3x3 conv, 256 and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • Deeper network
3x3 conv, 512
Pool 1/2 AlexNet: 8 layers
3x3 conv, 512 VGGNet: 16 - 19 layers
3x3 conv, 512
3x3 conv, 512
Pool 1/2
FC 4096
• ZFNet: 11.7% top 5 error in ILSVRC’13
FC 4096 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 1000
Softmax
117
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.
118
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64 VGGNet
Pool
3x3 conv, 128
3x3 conv, 128 VGG16:
Pool TOTAL memory: 24M * 4 bytes ~= 96MB / image
3x3 conv, 256
3x3 conv, 256 TOTAL params: 138M parameters
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax 119
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
121
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
122
[Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
123
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet
124
[Szegedy et al., 2014]
GoogleNet
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)
125
[Szegedy et al., 2014]
GoogleNet
“Inception module”: design a good local network topology (network within
a network) and then stack these modules on top of each other
Filter
concatenation
1x1 3x3 5x5 1x1
convolution convolution convolution convolution
Previous layer
126
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)
127
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
128
[Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
129
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
131
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
133
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
134
[He et al., 2015]
ResNet
136
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)
137
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected
138
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
139
[He et al., 2015]
Accuracy comparison
140
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power
consumption
141
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
142
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
143
Countless applications
144
An Architecture for Many Applications
Detection
Semantic segmentation
End-to-end robotic control
145
Semantic Segmentation: Fully Convolutional Networks
tf.keras.layers.Conv2DTranspose
146
Long+ CVPR 2015
Facial Detection & Recognition
147
Self-Driving Cars
148
Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception
(ex.camera)
Coarse
Maps
(ex.GPS)
149
Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation
151
Optimizing Images
152
Up-scaling low-resolution images
153
Medicine, Biology, Healthcare
154
Gulshan+ JAMA 2016.
Breast Cancer Screening
AI AI
MD MD
Readers Readers
BrainTumors
Dong+ MIUA
2017.
156
Dong+ MIUA 2017;Soleimany+ arXiv 2019
DeepBind
157
[Alipanahi et al., 2015]
Predicting disease mutations
158
[Alipanahi et al., 2015]
CNN AND SELF DRIVING
ROBOTIC CARS
159
19
159
5
Driverless Cars: Implications for
Travel Behavior - #AutoBhatSX
160
Outline
• Motivation
• Automated vehicle technology
• Activity-travel behavior considerations
• Infrastructure planning & modeling implications
• Conclusions
161
The Context
Automated Vehicles: Vehicles that are able to guide themselves
from an origin point to a destination point desired by the
individual
162
Motivation
163
McKinsey: Autonomous Cars One of 12 Major Technology Disruptors
164
Automated Vehicles and Transportation
Technology
Traveler
Infrastructure
Behavior
165
Automated Vehicle
Technology
166
Two Types of Technology
167
Autonomous (Self-driving) Vehicle
168
Autonomous (Self-driving) Vehicle
169
Connected Vehicle Research
• Addresses suite of
technology and applications
using wireless
communications to provide
connectivity
• Among vehicle types
• Variety of roadway
infrastructure
170
Connected Vehicle Research
171
A “Connected” Vehicle
Data Sent
from the
Vehicle
Real-time
location, speed,
acceleration,
emissions, fuel
consumption,
and vehicle
Data Provided to
diagnostics data
the Vehicle
Real-time traffic
information, safety
messages, traffic
signal messages,
Improved Powertrain eco-speed limits, eco-
routes, parking
More fuel efficient powertain including; hybrids, electric information, etc.
vehicles, and other alternative power sources
172
Levels of Vehicle Automation
Level 0: No automation
173
Government Recognition
174
Infrastructure Needs/Planning Driven By…
175
Smarter Infrastructure
176
Technology and Infrastructure Combination Leads To…
• Safety enhancement
• Virtual elimination of driver error – factor in 80% of crashes
• Enhanced vehicle control, positioning, spacing, speed,
harmonization
• No drowsy, impaired, stressed, or aggressive drivers
• Reduced incidents and network disruptions
• Offsetting behavior on part of driver
177
Technology and Infrastructure Combination Leads To…
• Capacity enhancement
• Platooning reduces headways and improves flow at transitions
• Vehicle positioning (lateral control) allows reduced lane widths and
utilization of shoulders; accurate mapping critical
• Optimized route choice
178
RoboCup (1997~)
179
6
Home Robots
PR2 Fetches Baverage (Willow Garage) PR2 Making Popcorns (TU Munich)
180
7
Human-Like Robots
181
8
Robot Life in a City
https://www.youtube.com/watch?v
=gPzC88HkgcU&t=80s
“Cognitive”
Smart
Machines
Body
(HW, Device)
183
Mind (SW, Data) 10
Enabling Technologies for AI Robots
• Perception
▪ Object recognition
▪ Person tracking
• Control
▪ Manipulation
▪ Action control
• Navigation
▪ Obstacle avoidance
▪ Map building & localization
• Interaction
▪ Vision and voice
▪ Multimodal interaction
• Computing Power
▪ Cloud computing
▪ GPUs, parallel computing
▪ Neural processors
184
11
2. Deep Learning for AI Robots
185
Traditional Machine Learning
vs. Deep Learning
186
13
Deep Learning Revolution
187
Power of Deep Learning
AlphaGo
2016
189
16
Deep Learning for Voice and Dialogue
• Speech LSTM-RNN (Recurrent Neural Networks)
• End-to-End Memory Networks (N2N MemNet)
• CNN + RNN for Question Answering
Sukhbaatar, Sainbayar, Jason Weston, and Rob Gao, Haoyuan, et al. "Are You Talking to a Machine? Dataset and
Fergus. "End-to-end memory networks." Advances in Methods for Multilingual Image Question." Advances in Neural
Neural Information Processing Systems. 2015. Information Processing Systems. 2015.
190
17
Interaction: Conversational Interface
191
18
Deep Learning for Robotic Grasping
192
(C) 2015-2016, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 19
Deep Reinforcement Learning
for Action Control
193
20
Deep Learning for Perception
• ImageNet Large-Scale Visual Recognition Challenge
▪ Image Classification/Localization
▪ 1.2M labeled images, 1000 classes
▪ Convolutional Neural Networks (CNNs) has been dominating the
contest since..
• 2012 non-CNN: 26.2% (top-5 error)
• 2012: (Hinton, AlexNet)15.3% (Using GPUs)
• 2013: (Clarifai) 11.2%
• 2014: (Google, GoogLeNet) 6.7%
• (pre-2015): (Google) 4.9%
– Beyond human-level performance
194
21
Deep Learning for Video Analysis
• Use 3D CNNs to model the temporal patterns as well as
the spatial patterns
N etwor ks for H uman A ction R ecognition
t em p or al
195
23*2@54x34 23*2@27x17
apping B ased on the 3D convolution described above, a variety
S. Jwi,e dKe.scY
res.F igureof3C
.N u, et al., PAMI, 2013
AN3aDrch
CitNecNtuarersch
caitneb
cetudreevifso
ed
r .hIu
nmthaenfoallcotwio
ribe a 3D C NN architecture that we have devel-
inng,recognition. This architecture consists of 1 hardwired layer, 3 convo-
lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text.
Deep Learning for Autonomous Driving
(NVIDIA)
196
23
3. New AI
197
Human Intelligence and Artificial Intelligence
198
Autonomous Machine Learning
1G: Supervised Learning 2G: Unsupervised Learning 3G: Autonomous Learning
(1980s~2000) (2000~Present) (Next Generation)
199
ⓒ 2005-2015 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 35
Future of AI
Technology Superhuman AI
Free Will
Parallel Computing Human-Level AI
Autonomous Agency
(Embodied Brain-Like) Cognitive AI
Works out
AI with Deep Learning own goals
201
Google Self Driving Car
CONTENTS
Introduction
Technology
What Is It ?
How Does It Work ?
Equipment Used
202
Google Self Driving Car
Advantages
limitations
References
203
Google Self Driving Car
Introduction
Google self-driving car is a project by google that involves
developing technology for mainly electric cars.
The software installed in Google's cars is called Google
Chauffeur.
The project was formerly led by Sebastian Thrun, former
director of the Stanford Artificial Intelligence
Laboratory and co-inventor of Google Street View
Google plans to make these cars available to the public in
2020.
204
Google Self Driving Car
Technology
The project team has equipped a number of different
types of cars with the self-driving equipment, including
the Toyota, Audi TT, and Lexus RX450h, Google has also
developed their own custom vehicle, which is assembled
by Roush Enterprises and uses equipment
from Bosch, LG.
Google's robotic cars have about $150,000 in equipment
including a $70,000 LIDAR system.
205
Google Self Driving Car
Laser allows the vehicle to generate a detailed 3D map of its
environment.
The car then takes these generated maps and combines them
with high-resolution maps of the world.
As of June 2014, the system works with a very high definition
inch-precision map of the area the vehicle is expected to use.
206
Google Self Driving Car
What is it?
It is the first truly driverless electric car prototype built by
Google to test the next stage of its five-year-old self-driving
car project.
It looks like a cross between a Smart car and a Nissan
Micra, with two seats and room enough for a small amount
of luggage.
It is the first real physical incarnation of Google’s vision of
what a self-driving car of the near future could be.
207
Google Self Driving Car
208
Google Self Driving Car
The software can recognise objects, people, cars, road
marking, signs and traffic lights, obeying the rules of
the road and allowing for multiple unpredictable
hazards, including cyclists. It can even detect road
works and safely navigate around them
209
Google Self Driving Car
Equipment Used
Lidar System
Video Cameras
Radar Sensors
Ultrasonic Sensors
Central Computer
210
Google Self Driving Car
Lidar
Video Cameras
Different types
of cameras are
installed at various
locations.
211
Google Self Driving Car
Radar Sensors
Ultrasonic Sensors
212
Google Self Driving Car
Central Computer
213
Google Self Driving Car
214
Google Self Driving Car
Advantages
Managing traffic flow.
Relieving Vehicles.
Avoid accidents.
Increase roadway capacity.
Determine current location.
215
Google Self Driving Car
Limitations
Vehicles can be switched off on the road (in rare cases).
Less security when using Internet.
Hackers can be change routes (in rare cases).
In case of failure of Sensors vehicle can created a
chance of accidents.
In rainfall car cannot be recognised traffic signals.
216
Google Self Driving Car
References
https://en.m.wikipedia.org/wiki/Google_self-
driving_car
Google self driving car project
217
Self Driving Robot
Wouldn’t it be cool to build your very own self-driving car using some of the same
techniques the big guys use? In this and next few articles, I will guide you through how
to build your own physical, deep-learning, self-driving robotic car from scratch. You
will be able to make your car detect and follow lanes, recognize and respond to traffic
signs and people on the road in under a week
218
Self Driving Robot
deep Convolutional Neural Network to detect road features and make the correct
steering decisions
https://towardsdatascience.com/deeppicar-part-1-102e03c83f2c
220
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
https://www.researchgate.net/publication/321535021_CNN-Based_Vision_Model_for_Obstacle_Avoidance_of_Mobile_Robot
221
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
Vision model of obstacle avoidance
A mobile robot obstacle avoidance problem in an indoor environment was presented based
on Convolutional Neural Network (CNN), which takes the raw image obtained from camera as
only input. And the method converts directly the raw pixels to steering commands including
turn left, turn right and go straight. Training data was collected by a human remotely
controlled mobile robot which was manipulated to explore in a structure environment
without colliding into obstacles. Our neural network was trained under caffe framework and
specific instructions are executed by the Robot Operating System (ROS).
222
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
Vision model of obstacle avoidance
A package named ROS_caffe which provide a bridge between caffe and ROS. The raw
image was acquired by ROS from robot, a trained CNN model classifies the images to
decide which direction will go next for the robot, and specific instructions are
executed by the ROS
223
Test environment
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
224
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
Experimental results
Left
Right
Straight (Go)
225
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
226
ML Vision Model for Obstacle Avoidance of a car like robot Robot
227
Localization Helps Self-Driving Cars Find Their Way - NVIDIA DRIVE Labs
228
Torch: http://bit.ly/1KzuFhd
• Created/Used by NYU, Facebook, Google DeepMind. De rigeur for deep learning
research
• Its language is Lua, NOT Python. Lua’s syntax is somewhat Pythonic.
• Torch’s main strengths are its features, which is why I mention it though here we
are at PyData.
Caffe: http://bit.ly/1Db2bHT
• Created/Used by Berkeley, Google
• Best tool to get started: many pre-trained reference models, and standard deep
learning datasets
• Easy to configure networks with config files
Theano: http://bit.ly/1KBsMAv
• Created/Used by University of Montreal
• Very flexible, very sophisticated: Lower level interface allows for lots of
customization, with many libraries being built ON TOP of Theano, e.g.: Keras,
PyLearn2, Lasagne, etc.
• Pythonic API, and very well documented.
229
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
230
References et resources
Convolutional Neural Networks Prof. Manolis Kellis MIT-2020
http://mit6874.github.io
Resources
https://en.m.wikipedia.org/wiki/Google_self-driving_car
Google self driving car project
UFLDL Tutorial
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
Démo Hinton
http://www.cs.toronto.edu/~hinton/adi/index.htm 231