0% found this document useful (0 votes)
4 views231 pages

Chapitre-8-2024 (1)

The document discusses Convolutional Neural Networks (CNNs) and their applications in scene understanding and object recognition. It covers the foundational concepts of classical machine vision, the architecture of modern CNNs, and their numerous applications across various fields. Key advancements in CNN architectures such as AlexNet, VGGNet, GoogleNet, and ResNet are highlighted, showcasing their evolution and impact on image processing tasks.

Uploaded by

rabebbenabed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views231 pages

Chapitre-8-2024 (1)

The document discusses Convolutional Neural Networks (CNNs) and their applications in scene understanding and object recognition. It covers the foundational concepts of classical machine vision, the architecture of modern CNNs, and their numerous applications across various fields. Key advancements in CNN architectures such as AlexNet, VGGNet, GoogleNet, and ResNet are highlighted, showcasing their evolution and impact on image processing tasks.

Uploaded by

rabebbenabed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 231

République Tunisienne

Ministère de l’Enseignement Supérieur et


de la Recherche Scientifique
Université de Sfax

Département Génie I.I.


Classes: 3 GII-CRI & 2MR-ISI

CHAPITRE 8

APPRENTISSAGE PROFOND
ET ROBOTS MOBILES

MOHAMED SLIM MASMOUDI


Maître de conférences à l’ENET’COM
A.U: 2024/2025 1
Réseaux pour l’apprentissage profond

2
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3
1a. What do you see, and how?
Can we teach machines to see?

4
What do you see?

5
How do you see?

How can we help


computers see? 6
How do you see?

How can we help


computers see? 7
What computers ‘see’: Images as Numbers
What you see What you both see What the computer "sees"

Levin Image Processing & ComputerVision


Input Image Input Image + values Pixel intensity values
(“pix-el”=picture-element)
An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image.
Question:is this Lincoln?Washington? Jefferson? Obama?
How can the computer answer this question?
Can I just do classification on the 1,166400-long image vector directly?
8
No.Instead:exploit image spatial structure.Learn patches.Build them up
1b. Classical machine vision roots
in study of human/animal brains

9
Inspiration: human/animal visual cortex

• Layers of neurons: pixels, edges, shapes, primitives, scenes 10


• E.g. Layer 4 responds to bands w/ given slant, contrasting edges
Primitives: Neurons & action potentials

• Chemical accumulation across • Each neuron receives multiple • Weak stimuli ignored
dendritic connections signals from its many dendrites • Sufficiently strong cross
• Pre-synaptic axon • When threshold crossed, it fires activation threshold
 post-synaptic dendrite • Its axon then sends outgoing • Non-linearity within
 neuronal cell body signal to downstream neurons each neuronal level

• Neurons connected into circuits (neural networks): emergent properties, learning, memory
• Simple primitives arranged in simple, repetitive, and extremely large networks
• 86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections 11
Abstraction layers: edges, bars, dir., shapes, objects, scenes

LGN: Small dots


Lateral Geniculate Nucleus

V1: Orientation,
disparity, some color

V4: Color, basic shapes,


2D/3D, curvature

VTC: Complex features


and objects (VTC: ventral temporal corte
• Primitives of visual concepts encoded in • Abstraction layers  visual cortex layers 12
neuronal connection in early cortical layers • Complex concepts from simple parts, hierarchy
General “learning machine”, reused widely

Hardware
expansion
chimp human

• Massive recent expanse of human brain has re-used a


relatively simple but general learning architecture

• Hearing, taste, smell, sight, touch all re-


use similar learning architecture
Visual Cortex Motor Cortex • Interchangeable
circuitry
• Auditory cortex
learns to ‘see’ if
sent visual signals
• Injury area tasks • Not fully-general learning, but well-adapted to our world
shift to uninjured • Humans co-opted this circuitry to many new applications
areas • Modern tasks accessible to any homo sapiens (<70k years)
13
• ML primitives not too different from animals: more to come?
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
14
2a. Spatial structure
for image recognition

15
Using Spatial Structure

Input: 2D Idea: connect


image. patches of input to
Array of pixel neurons in hidden
values layer.
Neuron connected
to region of input.
Only “sees”these
values.

16
Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.


Use a sliding window to define connections.
How can we weight the patch to detect particular features?

17
Feature Extraction with Convolution
- Filter of size 4x4 :16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch

This“patchy” operation is convolution

1) Apply a set of weights – a filter – to extract local features

2) Use multiple filters to extract different features

3) Spatially share parameters of each filter

18
Fully Connected Neural Network

Input: Fully Connected:


• 2D image • Each neuron in
• Vector of pixel hidden layer
values connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters

Key idea: Use spatial structure in input to inform architecture


of the network
19
High Level Feature Detection

Let’s identify key features in each image category

Nose,Eyes,Mouth Wheels,License Plate, Door,Windows,Steps


Headlights

20
Fully Connected Neural Network

21
2b. Convolutions and filters

22
Convolution operation is element wise
multiply and add

Filter / Kernel

23
CNNs for Classification

1. Convolution:Apply filters to generate feature maps.

2. Non-linearity:Often ReLU.
3. Pooling: Downsampling operation on each feature map.
Train model with image data.
Learn weights of filters in convolutional layers.
tf.keras.layers.Conv2 24
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
25
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter

26
Convolution
These are the network
parameters to be learned.

1 -1 -1
1 0 0 -10 1 -1 0 Filter 1 1
0 1 0 -10 -1 1 1 0
0 0 1 1 0 0
1 0 0 -10 1 -1 1 0
0 1 0 -10 1 -1 1 Filter 2 0
0 0 1 -10 1 -1 1 0



6 x 6 image
Each filter detects a
small pattern (3 x 3). 27
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image

28
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image

29
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1

30
Convolution -1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map
0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images 31
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels

11 -1-1 -1-1 -1-1 11 -1-1


1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1-1 11 -1-1
-1-1 -1-1 11 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 32
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected …



0 1 0 0 1 0
0 0 1 0 1 0
x36
33
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to
16 1 9 inputs, not
fully connected

34
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters

35
The whole CNN

cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
36
Max Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
37
Why Pooling

Subsampling pixels will not change the object


bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image
A CNN compresses a fully connected
network in two ways:
Reducing number of connections
Shared weights on the edges
Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


The whole CNN

cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


Flattening 3

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are 25
-1 -1 1
-1 1 -1 … (filters) 3x3 (dimension). Max Pooling
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
9 25 x 26 x 26
each filter?
Max Pooling
25 x 13 x 13

Convolution
How many parameters 225= 50 x 11 x 11
for each filter? 25x9
Max Pooling
50 x 5 x 5
Reminder:
Output size = (N+2P-F)/stride + 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution 25F , 3 x3

25 x 26 x 26
Fully connected Max Pooling 25 , 2 x 2
feedforward network
25 x 13 x 13

Convolution 50F, 3 x3

50 x 11 x 11

Max Pooling 50 , 2 x 2
1250 50 x 5 x 5
Flattened
Reminder:
Output size = (N+2P-F)/stride + 1
AlphaGo*

Next move
Neural
(19 x 19
Network positions)

19 x 19 matrix
Black: 1 Fully-connected feedforward
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo est un programme informatique capable de jouer au jeu de go (Japonais),
développé par l'entreprise britannique DeepMind et racheté en 2014 par Google.
AlphaGo’s policy network

The following is quotation from their Nature article:


Note: AlphaGo does not use Max Pooling.
CNN in speech recognition

The filters move in the


CNN frequency direction.
Frequency

Image Time
Spectrogram
CNN in text classification

Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
CNN: An Example (Animation)
padding

No padding, stride 1 Padding 1, stride 1

Padding 1, stride 2 (odd)

Reminder:
Output size = (N+2P-F)/stride + 1
No padding, stride 2 Padding 1, stride 2

padding adds additional rows and columns of pixels Stride is how far the filter moves in every step
around the edges of the input data so that the size of along one direction
the output feature map is the same as the size of the *source:
https://github.com/vdu
input data. moulin/conv_arithmetic
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32

32
3
Output volume size = ?

Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10

Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
3 10
Number of parameters in this layer?

Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760

NB of parameters = ( (width of filter * height of filter * number of images in the previous layer +1 ) *
number of filters)

Reminder:
Output size = (N+2P-F)/stride + 1
*reference :
http://cs231n.
stanford.edu/2
CNN: Convolution
• ConvNet is a sequence of Convolutional layers, followed by non-linearity

32×32×3 image

32 28 24
20

Conv, Conv, Conv,


ReLU ReLU 24 ReLU 20
28 10
e.g., 4 e.g., 6 6 e.g., 10
32
5×5×3 4 5×5×4 5×5×6
3 filters filters filters

ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]

Reminder:
Output size = (N+2P-F)/stride + *reference:
1 http://cs231n.stanford.edu/2017/
*Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling
• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect

224×224×64
112×112×64

Pooling

224 112
Downsampling
112
224
*reference : http://cs231n.stanford.edu/2017/ 54
CNN: Pooling
• Max pooling and average pooling
• With 2×2 filters and stride 2

ROI pooling

• Another kind of pooling layers are also used


• e.g. stochastic pooling, ROI pooling

*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization
• Visualization of CNN feature representations [Zeiler et al., 2014]
• VGG-16 [Simonyan et al., 2015]

*reference : http://cs231n.stanford.edu/2017/ 56
CNN in Computer Vision: Everywhere

Classification and retrieval [Krizhevsky et al., 2012]


CNN in Computer Vision:
Everywhere
Detection [Ren et al., 2015] Segmentation [Farabet et al., 2013]

*reference :
http://cs231n.
stanford.edu/2
CNN in Computer Vision:
Self-driving cars Human pose estimation [Cae et al., 2017]
Everywhere

Image captioning [Vinyals et al., 2015][Karpathy et al., 2015]

*reference :
http://cs231n.
stanford.edu/2
Automated Vehicles/Robots and Transportation
Producing Feature Maps

Original Sharpen Edge Detect “Strong” Edge


Detect

64
A simple pattern: Edges
How can we detect edges with a kernel?

Input

-1 -1 Output
Filter

65
(Goodfellow 2016)
Simple Kernels* / Filters

66
* the central or most important part of something, a central or basic part
X or X?

Image is represented as matrix of pixel values… and computers are literal!


We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed.

67
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution

68
Zero Padding Controls Output Size
(Goodfellow 2016)

• Same convolution: zero pad input so output • Valid-only convolution: output only when
is same size as input dimensions entire kernel contained in input (shrinks output)
•Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)

x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')

• TF convolution operator takes stride and zero fill option as parameters


• Stride is distance between kernel applications in each dimension 69
• Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
70
3a. Learning Visual Features
de novo

71
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)

Low level features Mid level features High level features

Edges,dark spots Eyes,ears,nose Facial structure

72
Lee+ ICML 2009
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image

73
Feature Extraction with Convolution

1) Apply a set of weights – a filter – to extract local features


2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
74
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction

75
[LeCun et al., 1998]
LeNet-5
conv avg pool conv avg pool
...
5 5 f=2 5 5 f=2
s=1 s=2 s=1 s=2
32 32 1 28 28 6 14 14 6 10 10 16

FC FC
...

10
5 5 16
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
76
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5

• Only 60K parameters


• As we go deeper in the network:
• General structure:
conv->pool->conv->pool->FC->FC->output

• Different filters look at different channels


• Sigmoid and Tanh nonlinearity

77
[LeCun et al., 1998]
Backpropagation of convolution

78
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)

79
An image classification CNN

80
Representation Learning in Deep CNNs

Low level features Mid level features High level features

Edges,dark spots Eyes,ears,nose Facial structure


Conv Layer 1 Conv Layer 2 Conv Layer 3

81
Lee+ ICML 2009
CNNs for Classification

1. Convolution:Apply filters to generate feature maps.


2. Non-linearity:Often ReLU.
3. Pooling:Downsampling operation on each feature map.
tf.keras.layers.Conv2
Train model with image data. D
Learn weights of filters in convolutional layers. tf.keras.activations.
*
82
tf.keras.layers.MaxPool2
D
Example – Six convolutional layers

83
Convolutional Layers: Local Connectivity
tf.keras.layers.
Conv2D

For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias

84
Convolutional Layers: Local Connectivity
tf.keras.layers.Conv2D

For a neuron in hidden layer:


• Take inputs from patch
• Compute weighted sum
• Apply bias

4x4 filter:
1) applying a window of weights
matrix of 2) computing linear combinations
weights wij for neuron (p,q) in hidden layer 85
3) activating with non-linear function
CNNs: Spatial Arrangement of Output
Volume
depth
Layer Dimensions:
ℎ  w d
where h and w are spatial
dimensions d (depth) = number of
height filters

Stride:
Filter step size

Receptive Field:
width Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )

86
Introducing Non-Linearity
- Apply after every convolution operation
(i.e.,after convolutional layers) Rectified Linear Unit
- ReLU:pixel-by-pixel operation that replaces (ReLU)
all negative values by zero.
- Non-linear operation

tf.keras.layers.ReLU

87
Karn Intuitive CNNs
Pooling

tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
) strides=2 1) Reduced
dimensionality
2) Spatial invariance

Max Pooling,average pooling


88
The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution

x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')


x = tf.nn.bias_add(x, b)
x= tf.nn.relu(x)

f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it? 89
Pooling reduces dimensionality by giving up
spatial location
•max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID

x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],


padding='SAME')

Output Input Pooling Batch H W Input channel


Neighborhood
[batch, height, width, channels]

90
Dilated Convolution

91
CNNs for Classification: Feature Learning

1. Learn features in input image through convolution


2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
92
CNNs for Classification: Class Probabilities

- CONV and POOL layers output high-level features of input


- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
93
Putting it all together
import tensorflow as tf

def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),

# second convolutional layer


tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),

# fully connected classifier


tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu’),
tf.keras.layers.Dense(10, activation=‘softmax’)
# 10 outputs

])
return model

94
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
95
4a. Real-world feature invariance is
hard

96
How can computers recognize objects?

97
How can computers recognize objects?

Challenge:
•Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
•How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them 98
Feature invariance to perturbation is hard

99
Li/Johnson/Yeung C231n
Next-generation models
explode # of parameters

100
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction

101
[LeCun et al., 1998]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

102
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet

• ImageNet Classification with Deep Convolutional


Neural Networks - Alex Krizhevsky, Ilya Sutskever,
Geoffrey E. Hinton; 2012
• Facilitated by GPUs, highly optimized convolution
implementation and large datasets (ImageNet)
• One of the largest CNNs to date
• Has 60 Million parameter compared to 60k
parameter of LeNet-5

103
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

• The annual “Olympics” of computer vision.

• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.

• 2012 marked the first year where a CNN was used to


achieve a top 5 test error rate of 15.3%.

• The next best entry achieved an error of 26.2%.


104
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

105
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture AlexNet
CONV1
• Input: 227x227x3 images (224x224 before
MAX POOL1
padding)
NORM1
CONV2 • First layer: 96 11x11 filters applied at stride 4
MAX POOL2
NORM2 • Output volume size?
CONV3 (N-F)/s+1 = (227-11)/4+1 = 55 ->
CONV4 [55x55x96]
CONV5
Max POOL3 • Number of parameters in this
FC6 layer? (11*11*3)*96 = 35K
FC7 Reminder:
Output size = (N+2P-F)/stride + 1
FC8
106
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet

107
[Krizhevsky et al., 2012]
Architecture AlexNet
CONV1
MAX POOL1 • Input: 227x227x3 images (224x224 before
padding)
NORM1
CONV2 • After CONV1: 55x55x96
MAX POOL2 • Second layer: 3x3 filters applied at stride 2
NORM2
CONV3 • Output volume size?
CONV4 (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96]
CONV5
Max POOL3 • Number of parameters in this
FC6 layer? 0!
FC7
FC8
108
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
conv max pool conv max pool
...
11 11 3 3 5 5 3 3
s=4 s=2 S=1 s=2
227 227 3 P = 0 55 55 6 27 27 96 P=2 27 27 256

conv conv conv max pool


... ...
3 3 3 3 3 3 3 3
S=1 s=1 S=1 s=2
13 13 P=1 P=1 P = 1 13 13 256
13 13 384 13 13 384 6 6 256
256
Reminder:
Output size = (N+2P-F)/stride + 1
109
This slide is taken from Andrew Ng [Krizhevsky et al., 2012]
AlexNet

FC FC
...

Softmax
1000
4096 4096

110
This slide is taken from Andrew Ng [Krizhevsky et al., 2012]
AlexNet
Details/Retrospectives:
• first use of ReLU
• used Norm layers (not common anymore)
• heavy data augmentation
• dropout 0.5
• batch size 128
• 7 CNN ensemble

111
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
• Trained on GTX 580 GPU with only 3 GB of memory.

• Network spread across 2 GPUs, half the neurons (feature


maps) on each GPU.

• CONV1, CONV2, CONV4, CONV5:


Connections only with feature maps on same GPU.
• CONV3, FC6, FC7, FC8:
Connections with all feature maps in preceding layer,
communication across GPUs.

112
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet

AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.

113
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

114
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

115
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet

• Very Deep Convolutional Networks For Large Scale


Image Recognition - Karen Simonyan and Andrew
Zisserman; 2015
• The runner-up at the ILSVRC 2014 competition
• Significantly deeper than AlexNet
• 140 million parameters

116
[Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64 VGGNet
Pool 1/2
3x3 conv, 128
3x3 conv, 128 • Smaller filters
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 256
3x3 conv, 256 and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • Deeper network
3x3 conv, 512
Pool 1/2 AlexNet: 8 layers
3x3 conv, 512 VGGNet: 16 - 19 layers
3x3 conv, 512
3x3 conv, 512
Pool 1/2
FC 4096
• ZFNet: 11.7% top 5 error in ILSVRC’13
FC 4096 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 1000
Softmax
117
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.

• What is the effective receptive field of three 3x3 conv (stride


1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer

118
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64 VGGNet
Pool
3x3 conv, 128
3x3 conv, 128 VGG16:
Pool TOTAL memory: 24M * 4 bytes ~= 96MB / image
3x3 conv, 256
3x3 conv, 256 TOTAL params: 138M parameters
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax 119
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728

3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 =


147,456
Pool memory: 56*56*128=400K params: 0
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
Pool memory: 28*28*256=200K params: 0
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool memory: 14*14*512=100K params: 0
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Pool memory: 7*7*512=25K params: 0
FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448
FC 4096 memory: 4096 params: 4096*4096 = 16,777,216
FC 1000 memory: 1000 params: 4096*1000 = 4,096,000 120
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
Details/Retrospectives :
• ILSVRC’14 2nd in classification, 1st in localization
• Similar training procedure as AlexNet
• No Local Response Normalisation (LRN)
• Use VGG16 or VGG19 (VGG19 only slightly better, more
memory)
• Use ensembles for best results
• FC7 features generalize well to other tasks
• Trained on 4 Nvidia Titan Black GPUs for two to three weeks.

121
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet

VGG Net reinforced the notion that convolutional neural


networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work.
Keep it deep.
Keep it simple.

122
[Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

123
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet

• Going Deeper with Convolutions - Christian Szegedy et


al.; 2015
• ILSVRC 2014 competition winner
• Also significantly deeper than AlexNet
• x12 less parameters than AlexNet
• Focused on computational efficiency

124
[Szegedy et al., 2014]
GoogleNet
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)

125
[Szegedy et al., 2014]
GoogleNet
“Inception module”: design a good local network topology (network within
a network) and then stack these modules on top of each other

Filter
concatenation
1x1 3x3 5x5 1x1
convolution convolution convolution convolution

1x1 1x1 3x3 max


convolution convolution pooling

Previous layer

126
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)

127
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet

Introduced the idea that CNN layers didn’t always have to be


stacked up sequentially. Coming up with the Inception
module, the authors showed that a creative structuring of
layers can lead to improved performance and
computationally efficiency.

128
[Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

129
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet

• Deep Residual Learning for Image Recognition -


Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and
exploding gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
130
[He et al., 2015]
ResNet
• ILSVRC’15 classification winner (3.57% top 5
error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!

131
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?

• 56-layer model performs worse on both training and test error


-> The deeper model performs worse (not caused by overfitting)!
132
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.

• We will use skip connections allowing us to take the activation


from one layer and feed it into another layer, much deeper into
the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly

133
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.

134
[He et al., 2015]
ResNet

[He et al., 135


2015]
ResNet
Full ResNet architecture:
• Stack residual blocks
• Every residual block has two 3x3 conv layers
• Periodically, double # of filters and
downsample spatially using stride 2 (in each
dimension)
• Additional conv layer at the beginning
• No FC layers at the end (only FC 1000 to
output classes)

136
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)

137
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected

138
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet

The best CNN architecture that we currently have and is a


great innovation for the idea of residual learning.
Even better than human performance!

139
[He et al., 2015]
Accuracy comparison

The best CNN architecture that we currently have and is a


great innovation for the idea of residual learning.

140
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power
consumption

The best CNN architecture that we currently have and is a


great innovation for the idea of residual learning.

141
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

142
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
143
Countless applications

144
An Architecture for Many Applications

Detection
Semantic segmentation
End-to-end robotic control
145
Semantic Segmentation: Fully Convolutional Networks

FCN:Fully Convolutional Network.


Network designed with all convolutional layers,with downsampling and
upsampling operations

tf.keras.layers.Conv2DTranspose

146
Long+ CVPR 2015
Facial Detection & Recognition

147
Self-Driving Cars

148
Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception

Possible Control Commands


Raw
Perception

(ex.camera)

Coarse
Maps

(ex.GPS)

149
Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation

Amini+ ICRA 2019


Entire model trained end-to-end
without any human labelling or annotations
150
Automatic Colorization of Black and White Images

151
Optimizing Images

Post Processing Feature Optimization


(Color Curves and Details)

Post Processing Feature Optimization Post Processing Feature Optimization


(Illumination) (Color Tone: Warmness)

152
Up-scaling low-resolution images

153
Medicine, Biology, Healthcare

154
Gulshan+ JAMA 2016.
Breast Cancer Screening

AI AI
MD MD
Readers Readers

CNN-based system outperformed expert Breast cancer case


radiologists at detecting breast missed by radiologist
cancer from mammograms but detected byAI
155
Semantic Segmentation: Biomedical Image Analysis

BrainTumors
Dong+ MIUA
2017.

Origi Ground Segmenta Uncertai


nal Truth tion nty
Malaria Infection
Soleimany+ arXiv
2019.

156
Dong+ MIUA 2017;Soleimany+ arXiv 2019
DeepBind

157
[Alipanahi et al., 2015]
Predicting disease mutations

158
[Alipanahi et al., 2015]
CNN AND SELF DRIVING
ROBOTIC CARS

159
19
159
5
Driverless Cars: Implications for
Travel Behavior - #AutoBhatSX

Dr. Chandra Bhat (with Prof. Pendyala of ASU)


Center for Transportation Research
University of Texas

160
Outline

• Motivation
• Automated vehicle technology
• Activity-travel behavior considerations
• Infrastructure planning & modeling implications
• Conclusions

161
The Context
 Automated Vehicles: Vehicles that are able to guide themselves
from an origin point to a destination point desired by the
individual

 Individual yields near-full or partial control to artificial


intelligence technology
 Individual decides an activity-travel plan (or tour-specific
information)
 The plan is keyed into the car’s intelligence system
 The car (or an external entity connected to the car) decides
on a routing and circuit to complete the plan

162
Motivation

163
McKinsey: Autonomous Cars One of 12 Major Technology Disruptors

Source: Disruptive Technologies:


Advances that will transform life,
Business, and the global economy

McKinsey Global Institute


May 2013

164
Automated Vehicles and Transportation

Technology

Traveler
Infrastructure
Behavior

165
Automated Vehicle
Technology

166
Two Types of Technology

Self-Driving Vehicle (e.g., Google) Connected Vehicle

AI wirelessly connected to an external


AI located within the vehicle
communications network
“Outward-facing” in that sensors blast “Inward-facing” with the vehicle receiving
outward from the vehicle to collect external environment information through
information without receiving data inward wireless connectivity, and operational
from other sources commands from an external entity
Used in cooperation with other pieces of
AI used to make autonomous decisions on
information to make decisions on what is
what is best for the individual driver
“best” from a system optimal standpoint
AI not shared with other entities beyond the
AI shared across multiple vehicles
vehicle

A more “Capitalistic” set-up A more “Socialistic” set-up

167
Autonomous (Self-driving) Vehicle

• Google cars driven 500,000 miles – Release Date Expected 2018

168
Autonomous (Self-driving) Vehicle

169
Connected Vehicle Research

• Addresses suite of
technology and applications
using wireless
communications to provide
connectivity
• Among vehicle types
• Variety of roadway
infrastructure

170
Connected Vehicle Research

171
A “Connected” Vehicle
Data Sent
from the
Vehicle

Real-time
location, speed,
acceleration,
emissions, fuel
consumption,
and vehicle
Data Provided to
diagnostics data
the Vehicle

Real-time traffic
information, safety
messages, traffic
signal messages,
Improved Powertrain eco-speed limits, eco-
routes, parking
More fuel efficient powertain including; hybrids, electric information, etc.
vehicles, and other alternative power sources

172
Levels of Vehicle Automation
 Level 0: No automation

 Level 1: Function-specific Automation


 Automation of specific control functions, e.g., cruise control

 Level 2: Combined Function Automation


 Automation of multiple and integrated control functions, e.g.,
adaptive cruise control with lane centering

 Level 3: Limited Self-Driving Automation


 Drivers can cede safety-critical functions

 Level 4: Full Self-Driving Automation


 Vehicles perform all driving functions

173
Government Recognition

• Several US states have passed legislative initiatives


• National Highway Traffic and Safety Administration Policy
• Autopilot Systems Council in Japan
• Citymobil2 initiative in Europe

174
Infrastructure Needs/Planning Driven By…

• Complex activity-travel patterns


• Growth in long distance travel demand
• Limited availability of land to dedicate to infrastructure
• Budget/fiscal constraints
• Energy and environmental concerns
• Information/ communication technologies (ICT) and mobile
platform advances

Autonomous vehicles leverage technology to increase flow


without the need to expand capacity

175
Smarter Infrastructure

176
Technology and Infrastructure Combination Leads To…

• Safety enhancement
• Virtual elimination of driver error – factor in 80% of crashes
• Enhanced vehicle control, positioning, spacing, speed,
harmonization
• No drowsy, impaired, stressed, or aggressive drivers
• Reduced incidents and network disruptions
• Offsetting behavior on part of driver

177
Technology and Infrastructure Combination Leads To…

• Capacity enhancement
• Platooning reduces headways and improves flow at transitions
• Vehicle positioning (lateral control) allows reduced lane widths and
utilization of shoulders; accurate mapping critical
• Optimized route choice

• Energy and environmental benefits


• Increased fuel efficiency and reduced pollutant emissions
• Clean fuel vehicles
• Car-sharing

178
RoboCup (1997~)

179
6
Home Robots

PR2 Fetches Baverage (Willow Garage) PR2 Making Popcorns (TU Munich)

Dash at Hotel (Sevioke) SpotMini (Boston Dynamics)

180
7
Human-Like Robots

Humanoid Robot Nao (Aldebaran) Emotion Robot Pepper (SoftBank)

Life-Like Robots (Hanson Robotics) Atlas (Google Boston Dynamics)

181
8
Robot Life in a City

https://www.youtube.com/watch?v
=gPzC88HkgcU&t=80s

Obelix (University of Freiburg, Germany)


182
9
AI Robots for the 4th Industrial Revolution

“Cognitive”
Smart
Machines

Body
(HW, Device)

183
Mind (SW, Data) 10
Enabling Technologies for AI Robots
• Perception
▪ Object recognition
▪ Person tracking
• Control
▪ Manipulation
▪ Action control
• Navigation
▪ Obstacle avoidance
▪ Map building & localization
• Interaction
▪ Vision and voice
▪ Multimodal interaction
• Computing Power
▪ Cloud computing
▪ GPUs, parallel computing
▪ Neural processors
184
11
2. Deep Learning for AI Robots

185
Traditional Machine Learning
vs. Deep Learning

186
13
Deep Learning Revolution

• Big Data + Parallel Computing + Deep Learning


• From programming to learning
• Automated- or self-programming
• Paradigm shift in S/W
• Self-improving systems
• Intelligence explosion

187
Power of Deep Learning

• Multiple boundaries are n


eeded (e.g. XOR problem)
 Multiple Units

• More complex regions are


needed (e.g. Polygons)
 Multiple Layers

Big Data + Deep Learning


=>
Automatic Programming
188
AI / Deep Learning Growth

AlphaGo
2016

189
16
Deep Learning for Voice and Dialogue
• Speech LSTM-RNN (Recurrent Neural Networks)
• End-to-End Memory Networks (N2N MemNet)
• CNN + RNN for Question Answering

Sukhbaatar, Sainbayar, Jason Weston, and Rob Gao, Haoyuan, et al. "Are You Talking to a Machine? Dataset and
Fergus. "End-to-end memory networks." Advances in Methods for Multilingual Image Question." Advances in Neural
Neural Information Processing Systems. 2015. Information Processing Systems. 2015.
190
17
Interaction: Conversational Interface

Amazon Google SKT


Echo Home Nugu

191
18
Deep Learning for Robotic Grasping

(Levine et al, 2016)

192
(C) 2015-2016, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 19
Deep Reinforcement Learning
for Action Control

BRETT (Univ. of California, Berkeley)

193
20
Deep Learning for Perception
• ImageNet Large-Scale Visual Recognition Challenge
▪ Image Classification/Localization
▪ 1.2M labeled images, 1000 classes
▪ Convolutional Neural Networks (CNNs) has been dominating the
contest since..
• 2012 non-CNN: 26.2% (top-5 error)
• 2012: (Hinton, AlexNet)15.3% (Using GPUs)
• 2013: (Clarifai) 11.2%
• 2014: (Google, GoogLeNet) 6.7%
• (pre-2015): (Google) 4.9%
– Beyond human-level performance

194
21
Deep Learning for Video Analysis
• Use 3D CNNs to model the temporal patterns as well as
the spatial patterns
N etwor ks for H uman A ction R ecognition
t em p or al

3D C onvolutional N eural N etwor ks for H uman A cti on R ecognition

Figure 2. Extraction of multiple features from contiguous


frames. Multiple 3D convolutions can be applied to con- A. Karpathy, L. Fei-Fei, et al., CVPR 2014
tiguous frames to extract multiple features. As in F igure 1,
the sets of connections are color-coded so that the shared
weights are in the same color. Note that all the 6 sets of
connections do not share weights, resulting in two different
feature maps on the right. full
2x2 7x6x3 3D 3x3 7x4 connnection
7x7x3 3D
hardwired subsampling convolution subsampling convolution
convolution

set of lower-level feature maps. Similar to the case


ofinput:
2D convolution, this can be achieved by applying
7@60x40
multiple 3D convolutions with distinct kernels to the C6:
utions. same location in the previous layer (F igure 2). 128@1x1
mporal C4:
S5:
13*6@7x4
-coded 2.2. A 3D C N N A rchitecture H1: 13*6@21x12
C2: S3:
In 3D 33@60x40

195
23*2@54x34 23*2@27x17
apping B ased on the 3D convolution described above, a variety
S. Jwi,e dKe.scY
res.F igureof3C
.N u, et al., PAMI, 2013
AN3aDrch
CitNecNtuarersch
caitneb
cetudreevifso
ed
r .hIu
nmthaenfoallcotwio
ribe a 3D C NN architecture that we have devel-
inng,recognition. This architecture consists of 1 hardwired layer, 3 convo-
lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text.
Deep Learning for Autonomous Driving
(NVIDIA)

196
23
3. New AI

197
Human Intelligence and Artificial Intelligence

198
Autonomous Machine Learning
1G: Supervised Learning 2G: Unsupervised Learning 3G: Autonomous Learning
(1980s~2000) (2000~Present) (Next Generation)

• Decision Trees • Deep Networks • Complex Adaptive Systems


• Kernel Methods • Markov Networks • Perception-Action Cycle
• Multilayer Perceptrons • Bayesian Networks • Lifelong Continual Learning

199
ⓒ 2005-2015 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 35
Future of AI
Technology Superhuman AI
Free Will
Parallel Computing Human-Level AI
Autonomous Agency
(Embodied Brain-Like) Cognitive AI
Works out
AI with Deep Learning own goals

Narrow AI Works out own methods,


follows given goals
Sequential Follows given
Reactive goals and methods Time

1980 1990 2010 2020 2030 2050


Modified from Eliezer Yudkowsky & David Wood
20036
Google Self Driving Car

201
Google Self Driving Car

CONTENTS
 Introduction
 Technology
 What Is It ?
 How Does It Work ?
 Equipment Used
202
Google Self Driving Car

 Advantages
 limitations
 References

203
Google Self Driving Car

 Introduction
 Google self-driving car is a project by google that involves
developing technology for mainly electric cars.
 The software installed in Google's cars is called Google
Chauffeur.
 The project was formerly led by Sebastian Thrun, former
director of the Stanford Artificial Intelligence
Laboratory and co-inventor of Google Street View
 Google plans to make these cars available to the public in
2020.

204
Google Self Driving Car

 Technology
 The project team has equipped a number of different
types of cars with the self-driving equipment, including
the Toyota, Audi TT, and Lexus RX450h, Google has also
developed their own custom vehicle, which is assembled
by Roush Enterprises and uses equipment
from Bosch, LG.
 Google's robotic cars have about $150,000 in equipment
including a $70,000 LIDAR system.

205
Google Self Driving Car
 Laser allows the vehicle to generate a detailed 3D map of its
environment.
 The car then takes these generated maps and combines them
with high-resolution maps of the world.
 As of June 2014, the system works with a very high definition
inch-precision map of the area the vehicle is expected to use.

206
Google Self Driving Car

 What is it?
 It is the first truly driverless electric car prototype built by
Google to test the next stage of its five-year-old self-driving
car project.
 It looks like a cross between a Smart car and a Nissan
Micra, with two seats and room enough for a small amount
of luggage.
 It is the first real physical incarnation of Google’s vision of
what a self-driving car of the near future could be.

207
Google Self Driving Car

 How does it work?


 Powered by an electric motor with around a 100 mile
range, the car uses a combination of sensors and
software to locate itself in the real world combined with
highly accurate digital maps.
 A GPS is used, just like the satellite navigation systems
in most cars, to get a rough location of the car, at
which point radar, lasers and cameras take over to
monitor the world around the car, 360-degrees.

208
Google Self Driving Car
 The software can recognise objects, people, cars, road
marking, signs and traffic lights, obeying the rules of
the road and allowing for multiple unpredictable
hazards, including cyclists. It can even detect road
works and safely navigate around them

209
Google Self Driving Car

 Equipment Used
 Lidar System
 Video Cameras
 Radar Sensors
 Ultrasonic Sensors
 Central Computer

210
Google Self Driving Car
 Lidar

• The Liadar Sensors is designed for


obstacle detection and navigation of
autonomous ground vehicles.

 Video Cameras

 Different types
of cameras are
installed at various
locations.

211
Google Self Driving Car

 Radar Sensors

• The radars are installed at front and


back side of the car.

 Ultrasonic Sensors

• It is use to measure the position of object


very close to the vehicles, such as other
vehicles when parking.

212
Google Self Driving Car
 Central Computer

• Information from all the Sensors is analysed by a


central computer.
• It manipulates the steering, accelerators and
brakes.

213
Google Self Driving Car

214
Google Self Driving Car

 Advantages
 Managing traffic flow.
 Relieving Vehicles.
 Avoid accidents.
 Increase roadway capacity.
 Determine current location.

215
Google Self Driving Car

 Limitations
 Vehicles can be switched off on the road (in rare cases).
 Less security when using Internet.
 Hackers can be change routes (in rare cases).
 In case of failure of Sensors vehicle can created a
chance of accidents.
 In rainfall car cannot be recognised traffic signals.

216
Google Self Driving Car

 References

 https://en.m.wikipedia.org/wiki/Google_self-
driving_car
 Google self driving car project

217
Self Driving Robot
Wouldn’t it be cool to build your very own self-driving car using some of the same
techniques the big guys use? In this and next few articles, I will guide you through how
to build your own physical, deep-learning, self-driving robotic car from scratch. You
will be able to make your car detect and follow lanes, recognize and respond to traffic
signs and people on the road in under a week

218
Self Driving Robot

Lane Following Traffic Sign and People Detection

SunFounder PiCar-V Robotic Car Kit


219
Raspberry Pi 3 B+
Self Driving Robot
The main software tools we use are Python (the de-facto programming language for
Machine Learning/AI tasks), OpenCV (a powerful computer vision package)
and Tensorflow (Google’s popular deep learning framework). Note all the software we
use here are FREE and open source!

deep Convolutional Neural Network to detect road features and make the correct
steering decisions

https://towardsdatascience.com/deeppicar-part-1-102e03c83f2c
220
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot

https://www.researchgate.net/publication/321535021_CNN-Based_Vision_Model_for_Obstacle_Avoidance_of_Mobile_Robot
221
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
Vision model of obstacle avoidance

A mobile robot obstacle avoidance problem in an indoor environment was presented based
on Convolutional Neural Network (CNN), which takes the raw image obtained from camera as
only input. And the method converts directly the raw pixels to steering commands including
turn left, turn right and go straight. Training data was collected by a human remotely
controlled mobile robot which was manipulated to explore in a structure environment
without colliding into obstacles. Our neural network was trained under caffe framework and
specific instructions are executed by the Robot Operating System (ROS).

Structure of the CNN based on the AlexNet network

222
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot
Vision model of obstacle avoidance

A package named ROS_caffe which provide a bridge between caffe and ROS. The raw
image was acquired by ROS from robot, a trained CNN model classifies the images to
decide which direction will go next for the robot, and specific instructions are
executed by the ROS

The flow chart of running

223
Test environment
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot

Datasets and training results

(a)Train curve of dataset 1-the test (a)Train curve of dataset 2-


accuracy : 81.72% the test accuracy: 93.21%

224
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot

Experimental results

Left

Right

Straight (Go)

225
CNN-Based Vision Model for Obstacle Avoidance of Mobile Robot

CNN-Based Vision Model for Obstacle Avoidance of Robotino using 2 obstacles

226
ML Vision Model for Obstacle Avoidance of a car like robot Robot

227
Localization Helps Self-Driving Cars Find Their Way - NVIDIA DRIVE Labs

228
Torch: http://bit.ly/1KzuFhd
• Created/Used by NYU, Facebook, Google DeepMind. De rigeur for deep learning
research
• Its language is Lua, NOT Python. Lua’s syntax is somewhat Pythonic.
• Torch’s main strengths are its features, which is why I mention it though here we
are at PyData.

Caffe: http://bit.ly/1Db2bHT
• Created/Used by Berkeley, Google
• Best tool to get started: many pre-trained reference models, and standard deep
learning datasets
• Easy to configure networks with config files

Theano: http://bit.ly/1KBsMAv
• Created/Used by University of Montreal
• Very flexible, very sophisticated: Lower level interface allows for lots of
customization, with many libraries being built ON TOP of Theano, e.g.: Keras,
PyLearn2, Lasagne, etc.
• Pythonic API, and very well documented.

229
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
230
 References et resources
 Convolutional Neural Networks Prof. Manolis Kellis MIT-2020
 http://mit6874.github.io
Resources

 https://en.m.wikipedia.org/wiki/Google_self-driving_car
 Google self driving car project
UFLDL Tutorial
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Nando de Freitas - Deep Learning at Oxford


https://www.youtube.com/watch?v=dV80NAlEins&list=PLE6Wd 9FR--
EfW8dtjAuPoTuPcqmOV53Fu
Conférences http://videolectures.net/jul09_hinton_deeplearn/
http://videolectures.net/icml09_bengio_lecun_tldar/

Démo Hinton
http://www.cs.toronto.edu/~hinton/adi/index.htm 231

You might also like