DL Notes
DL Notes
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Welcome
deeplearning.ai
Andrew Ng
• AI is the new Electricity
• Electricity had once transformed
countless industries: transportation,
manufacturing, healthcare,
communications, and more
Andrew Ng
What you’ll learn
Courses in this sequence (Specialization):
1. Neural Networks and Deep Learning
2. Improving Deep Neural Networks: Hyperparameter
tuning, Regularization and Optimization
3. Structuring your Machine Learning project
4. Convolutional Neural Networks
5. Natur al Language Processing: Building sequence models
Andrew Ng
Introduction to
Deep Learning
What is a
deeplearning.ai
Neural Network?
Housing Price Prediction
price
size of house
Housing Price Prediction
Housing Price Prediction
size !"
#bedrooms !#
y
zip code !$
wealth !%
Introduction to
Deep Learning
Supervised Learning
deeplearning.ai with Neural Networks
Supervised Learning
Input(x) Output (y) Application
Home features Price Real Estate
Ad, user info Click on ad? (0/1) Online Advertising
Why is Deep
deeplearning.ai Learning taking off?
Andrew Ng
Scale drives deep learning progress
Performanc
e
Amount of data
Andrew Ng
Scale drives deep learning progress
Idea
• Data
• Computation
• Algorithms
Experiment Code
Andrew Ng
Introduction to
Neural Networks
Andrew Ng
Courses in this Specialization
1. Neural Networks and DeepLearning
2. Improving Deep Neural Networks:Hyperparameter
tuning, Regularization and Optimization
3. Structuring your Machine Learningproject
4. Convolutional Neural Networks
5. Natural Language Processing: Building sequencemodels
Andrew Ng
Outline of this Course
Week 1: Introduction
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Binary Classification
deeplearning.ai
Binary Classification
Blue
Green 255 134 93 22
Red 255 134
123 202
94 22
83 2
255 231
123 42
34 22
94 83 4 30
44 187
123 94
34 83
44 2 192
34 187
76 232 124
34 4434 187
76 92
67 232 34 142
83 194
34 7667 232
83 124
194 94
67 83 194 202
Andrew Ng
Notation
Andrew Ng
Basics of Neural
Network Programming
Logistic Regression
deeplearning.ai
Logistic Regression
Andrew Ng
Basics of Neural
Network Programming
Logistic Regression
deeplearning.ai
cost function
Logistic Regression cost function
! "
!" = % & ' + ) , where % * = "#$ +,
Given (' (.) , ! (.) ),…,(' (1) , ! (1) ) , want !" (2) ≈ ! 2 .
Loss (error) function:
Andrew Ng
Basics of Neural
Network Programming
Gradient Descent
deeplearning.ai
Gradient Descent
' ,
Recap: !" = % & ( + * , % + = ,-. /0
6 6
1 1
1 &, * = 4
5 ℒ(!" 7 , ! (7) ) =
− 5
4 ! (7) log !" 7 + (1 − ! (7) ) log(1 − !" 7 )
78, 78,
*
& Andrew Ng
Gradient Descent
Andrew Ng
Basics of Neural
Network Programming
Derivatives
deeplearning.ai
Intuition about derivatives
! " = 3"
"
Andrew Ng
Basics of Neural
Network Programming
More derivatives
deeplearning.ai
examples
Intuition about derivatives
! " = "$
"
Andrew Ng
More derivative examples
Andrew Ng
Basics of Neural
Network Programming
Computation Graph
deeplearning.ai
Computation Graph
Andrew Ng
Basics of Neural
Network Programming
Derivatives with a
deeplearning.ai Computation Graph
Computing derivatives
&=5
11 33
"=3 6 $ =&+! ) = 3$
!="#
#=2
Andrew Ng
Computing derivatives
&=5
11 33
"=3 6 $ =&+! ) = 3$
!="#
#=2
Andrew Ng
Basics of Neural
Network Programming
Logistic Regression
deeplearning.ai
Gradient descent
Logistic regression recap
! = $%& + (
)* = + = ,(!)
ℒ +, ) = −() log(+) + (1 − )) log(1 − +))
Andrew Ng
Logistic regression derivatives
&%
$%
&( ! = $% &% + $( &( + ) * = +(!) ℒ(a, 1)
$(
b
Andrew Ng
Basics of Neural
Network Programming
Gradient descent
deeplearning.ai
on m examples
Logistic regression on m examples
Andrew Ng
Logistic regression on m examples
Andrew Ng
Basics of Neural
Network Programming
Vectorization
deeplearning.ai
What is vectorization?
Andrew Ng
Basics of Neural
Network Programming
More vectorization
deeplearning.ai
examples
Neural network programming guideline
Whenever possible, avoid explicit for-loops.
Andrew Ng
Vectors and matrix valued functions
Say you need to apply the exponential operation on every element of a
matrix/vector.
!$
!= ⋮
!&
u = np.zeros((n,1))
for i in range(n):
u[i]=math.exp(v[i])
Andrew Ng
Logistic regression derivatives
J = 0, dw1 = 0, dw2 = 0, db = 0
for i = 1 to n:
! (") = " $ # (") + %
&(") = '(! (") )
* += − - (") log -1 " + (1 − - " ) log(1 − -1 " )
d! (") = &(") (1 − &(") )
(")
d"% += #% d! (")
(")
d"' += #' d! (")
db += d! (")
J = J/m, d"% = d"% /m, d"' = d"' /m, db = db/m
Andrew Ng
Basics of Neural
Network Programming
Vectorizing Logistic
deeplearning.ai
Regression
Vectorizing Logistic Regression
! (#) = & ' ( (#) + * ! (-) = & ' ( (-) + * ! (.) = & ' ( (.) + *
+(#) = ,(! (#) ) +(-) = ,(! (-) ) +(.) = ,(! (.) )
Andrew Ng
Basics of Neural
Network Programming
Vectorizing Logistic
deeplearning.ai Regression’s Gradient
Computation
Vectorizing Logistic Regression
Andrew Ng
Implementing Logistic Regression
J = 0, d!! = 0, d!" = 0, db = 0
for i = 1 to m:
" ($) = ! & # ($) + %
&($) = '(" ($) )
* += − - ($) log & $ + (1 − - $ ) log(1 − & $ )
d" ($) = &($) −- ($)
($)
d!! += #! d" ($)
($)
d!" += #" d" ($)
db += d" ($)
J = J/m, d!! = d!! /m, d!" = d!" /m
db = db/m
Andrew Ng
Basics of Neural
Network Programming
Broadcasting in
deeplearning.ai
Python
Broadcasting example
Calories from Carbs, Proteins, Fats in 100g of different foods:
Apples Beef Eggs Potatoes
Carb 56.0 0.0 4.4 68.0
Protein 1.2 104.0 52.0 8.0
Fat 1.8 135.0 99.0 0.9
cal = A.sum(axis = 0)
percentage = 100*A/(cal.reshape(1,4))
Broadcasting example
1 101
2 100 102
+ =
3 103
4 104
Explanation of logistic
deeplearning.ai regression cost function
(Optional)
Logistic regression cost function
Andrew Ng
Logistic regression cost function
If $ = 1: ( $ ) = $*
If $ = 0: ( $ ) = 1 − $*
Andrew Ng
Cost on m examples
Andrew Ng
Standard notations for Deep Learning ·Y ∈ Rny ×m is the label matrix
This document has the purpose of discussing a new standard for deep learning ·y (i) ∈ Rny is the output label for the ith example
mathematical notations.
·W [l] ∈ Rnumber of units in next layer × number of units in the previous layer is the
weight matrix,superscript [l] indicates the layer
1 Neural Networks Notations.
·b[l] ∈ Rnumber of units in next layer is the bias vector in the lth layer
General comments:
· superscript (i) will denote the ith training example while superscript [l] will ·ŷ ∈ Rny is the predicted output vector. It can also be denoted a[L] where L
denote the lth layer is the number of layers in the network.
1
2
Figure 1: Comprehensive Network: representation commonly used for Neural Figure 2: Simplified Network: a simpler representation of a two layer neural
[l]
Networks. For better aesthetic, we omitted the details on the parameters (wij network, both are equivalent.
[l]
and bi etc...) that should appear on the edges
Binary Classification
64
64
An image is stored in the computer in three separate matrices corresponding to the Red, Green, and Blue
color channels of the image. The three matrices have the same size as the image, for example, the
resolutionof the cat image is 64 pixels X 64 pixels, the three matrices (RGB) are 64 X 64 each.
The value in a cell represents the pixel intensity which will be used to create a feature vector of n-
dimension. In pattern recognition and machine learning, a feature vector represents an image, Then the
classifier's job is to determine whether it contain a picture of a cat or not.
To create a feature vector, 𝑥, the pixel intensity values will be “ unrolled” or “ reshaped” for each color. The
dimension of the input feature vector𝑥 is𝑛 = 64𝑥 64𝑥 3 = 12288.
red
green
blue
Logistic Regression
Logistic regression is a learning algorithm used in a supervised learning problem when the output 𝑦 are
all either zero or one. The goal of logistic regression is to minimize the error between its predictions and
training data.
(𝑤 𝑇 𝑥 + 𝑏) is a linear function (𝑎𝑥 + 𝑏), but since we are looking for a probability constraint between
[0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.
Recap:
𝑦̂ (𝑖) = 𝜎(𝑤 𝑇 𝑥 (𝑖) + 𝑏), where 𝜎(𝑧 (𝑖) )=
1 𝑥 (𝑖) the i-th training example
(𝑖)
1+ 𝑒 −𝑧
𝐺𝑖𝑣𝑒𝑛 {(𝑥 (1) , 𝑦 (1) ), ⋯ , (𝑥 (𝑚) , 𝑦 (𝑚) )}, 𝑤𝑒 𝑤𝑎𝑛𝑡 𝑦̂ (𝑖) ≈ 𝑦 (𝑖)
• If 𝑦 (𝑖) = 1: 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − log(𝑦̂ (𝑖) ) where log(𝑦̂ (𝑖) ) and 𝑦̂ (𝑖) should be close to 1
• If 𝑦 (𝑖) = 0: 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − log(1 − 𝑦̂ (𝑖) ) where log(1 − 𝑦̂ (𝑖) ) and 𝑦̂ (𝑖) should be close to 0
Cost function
The cost function is the average of the loss function of the entire training set. We are going to find the
parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.
𝑚 𝑚
1 1
𝐽(𝑤, 𝑏) = ∑ 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − ∑[( 𝑦 (𝑖) log(𝑦̂ (𝑖) ) + (1 − 𝑦 (𝑖) )log(1 − 𝑦̂ (𝑖) )]
𝑚 𝑚
𝑖=1 𝑖=1
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Neural Networks
deeplearning.ai
Overview
What is a Neural Network?
!"
!# %&
!$ x
b
!"
!# %&
!$ x
4 ["] ' ["] = 4 ["] ! + , ["] -["] = .(' ["] ) ' [#] = 4 [#] -["] + , [#] -[#] = .(' [#] ) ℒ(-[#] , %)
, ["] 4 [#]
, [#] Andrew Ng
One hidden layer
Neural Network
Neural Network
deeplearning.ai
Representation
Neural Network Representation
!"
!# %&
!$
Andrew Ng
One hidden layer
Neural Network
Computing a
deeplearning.ai Neural Network’s
Output
Neural Network Representation
!" !"
!# ) ! ! + + -(') , = %& !# %&
' ,
!$ !$
' = )!! + +
, = -(')
Andrew Ng
Neural Network Representation
$( $(
$) # ! $ + & +(!) ' = ./ $) ./
! '
$* $*
! = #!$ + & $(
' = +(!) $) ./
$*
Andrew Ng
Neural Network Representation
" ", ["] ["] "
'"" )" = +" ! + /" , '" = 3()" )
!" " ", ["] ["] "
'#" )# = +# ! + /# , '# = 3()# )
!# %& ["] ["]
'$
"
)$" = +$" , ! + /$ , ' $ = 3()$" )
!$ ["] ["]
'(" )(" = +(" , ! + /( , '( = 3()(" )
Andrew Ng
Neural Network Representation learning
(""
Given input x:
%"
(," " "
! =$ %+' "
%, "
./
(-
%- ( " = )(! "
)
(0"
! , =$ , (" +' ,
( , = )(! ,
)
Andrew Ng
One hidden layer
Neural Network
Vectorizing across
deeplearning.ai
multiple examples
Vectorizing across multiple examples
'" =) " !++ "
!" ," = -(' " )
!# %&
'# =) # ," ++ #
!$
,# = -(' # )
Andrew Ng
Vectorizing across multiple examples
for i = 1 to m:
! " ($) = ' " ( ($) + * "
+ " ($) = ,(! " $ )
! - ($) = ' - + " ($) + * -
+ - ($) = ,(! - $
)
Andrew Ng
One hidden layer
Neural Network
Explanation
deeplearning.ai for vectorized
implementation
Justification for vectorized implementation
Andrew Ng
Recap of vectorizing across multiple examples
for i = 1 to m
!"
!# %& ' " ()) = , " ! ()) + . "
!$ / " ()) = 0(' " ) )
' # ()) = , # / " ()) + . #
/ # ()) = 0(' # ) )
1 = ! (") ! (#) … ! (2)
6" =, " 1+. "
7" = 0(6 " )
6# =, # 7" +. #
A["] = /["](") /["](#) … /["](2)
7# = 0(6 # )
Andrew Ng
One hidden layer
Neural Network
Activation functions
deeplearning.ai
Activation functions
!"
!# %&
!$
Given x:
'" =) " !++ "
," = -(' " )
'# =) # ," ++ #
,# = -(' # ) Andrew Ng
Pros and cons of activation functions
a a
x
z
1
sigmoid: ! =
1 + & '(
a a
z z
Andrew Ng
One hidden layer
Neural Network
Why do you
deeplearning.ai need non-linear
activation functions?
Activation function
%"
%. 01
%/
Given x:
" "
! =$ %+' "
( " = )["] (! " )
! . =$ . (" +' .
( . = )[.] (! . )
Andrew Ng
One hidden layer
Neural Network
Derivatives of
deeplearning.ai activation functions
Sigmoid activation function
a
1
!(#) =
1 + ) *+
z
Andrew Ng
Tanh activation function
a
!(#) = tanh(#)
Andrew Ng
ReLU and Leaky ReLU
a a
z z
ReLU Leaky ReLU
Andrew Ng
One hidden layer
Neural Network
Andrew Ng
Formulas for computing derivatives
Andrew Ng
One hidden layer
Neural Network
Backpropagation
deeplearning.ai intuition (Optional)
Computing gradients
Logistic regression
%
# ! = #$% + ' ) = *(!) ℒ(), /)
'
Andrew Ng
Neural network gradients
& [$]
' )[$]
& ["] ! [#] = & [#] ' + ) [#] +[#] = ,(! [#] ) ! [0] = & [0] ' + ) [0] +[0] = ,(! [0] ) ℒ(+[0] , y)
)["]
Andrew Ng
Summary of gradient descent
!' [$] = ([$] − 5
*
!" [$] = [$]
!' ( )
, 1 ["] $ ,
!* [$] = [$]
!" ' + ["]
!* = !6 7
:
1
!- [$] = !" [$] !- = ;<. >?:(!6 " , '5A> = 1, BCC<!A:> = DE?C)
["]
:
!" [+] = * $.
!" [$] ∗ 0[+] ′(z + ) !6 [$] = * " % !6 ["] ∗ 0[$] ′(Z $ )
[+] [+] . 1
!* = !" 5 !* [$] = !6 [$] G %
:
1
!- [+] = !" [+] !-[$] = ;<. >?:(!6 $ , '5A> = 1, BCC<!A:> = DE?C)
:
Andrew Ng
One hidden layer
Neural Network
Random Initialization
deeplearning.ai
What happens if you initialize weights to
zero?
[!]
"# !!
[$]
!! %&
[!]
"$ !$
Andrew Ng
Random initialization
[!]
"# !!
[$]
!! %&
[!]
"$ !$
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Deep L-layer
deeplearning.ai Neural network
What is a deep neural network?
Andrew
Ng
Deep Neural
Networks
Forward Propagation
deeplearning.ai in a Deep Network
Forward propagation in a deep network
Andrew
Ng
Deep Neural
Networks
Getting your matrix
deeplearning.ai
dimensions right
Parameters ! ["] and " ["]
&!
$%
&"
Andrew Ng
Vectorized implementation
#!
!"
#"
Andrew Ng
Deep Neural
Networks
Why deep
deeplearning.ai
representations?
Intuition about deep representation
!"
Andrew Ng
Circuit theory and deep learning
Informally: There are functions you can compute with a
“small” L-layer deep neural network that shallower networks
require exponentially more hidden units to compute.
Andrew Ng
Deep Neural
Networks
Building blocks of
deeplearning.ai deep neural networks
Forward and backward functions
Andrew
Ng
Forward and backward functions
Andrew
Ng
Deep Neural
Networks
Parameters vs
deeplearning.ai
Hyperparameters
What are hyperparameters?
Parameters: ! " , % " ,! & ,% & ,! ' ,% ' …
Andrew Ng
Applied deep learning is a very
empirical process
Idea
cost !
Andrew Ng
Deep Neural
Networks
What does this
deeplearning.ai have to do with
the brain?
Forward and backward propagation
! ["] = # ["] $ + & ["] -! [%] = '[%] − +
1
'["] = ( " (! " ) -# [%] = -! % ' %
&
0
! [$] = # [$] '["] + & [$] 1
'[$] = ( $ (! $ ) -& [%] = 12. sum(d! % , 9:;< = 1, =>>2-;0< = ?@A>)
0 & %
-! [%'"] = -# % -! % (( (! %'" )
…
'[%] = ( % ! % = +,
…
& "
-! ["] = -# % -! $ (( (! " )
1 &
-# = -! " ' "
["]
0
1
-& = 12. sum(d! " , 9:;< = 1, =>>2-;0< = ?@A>)
["]
0
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Train/dev/test
deeplearning.ai sets
Applied ML is a highly iterative process
# layers Idea
# hidden units
learning rates
activation functions
… Experiment Code
Andrew Ng
Train/dev/test sets
Andrew Ng
Mismatched train/test distribution
Andrew Ng
Setting up your
ML application
Bias/Variance
deeplearning.ai
Bias and Variance
Andrew Ng
Bias and Variance
Cat classification
Andrew Ng
High bias and high variance
!#
!"
Andrew Ng
Setting up your
ML application
Basic “recipe”
deeplearning.ai for machine learning
Basic recipe for machine learning
Andrew Ng
Regularizing your
neural network
Regularization
deeplearning.ai
Logistic regression
min '(), *)
$,&
Andrew Ng
Neural network
Andrew Ng
Regularizing your
neural network
Why regularization
deeplearning.ai reduces overfitting
How does regularization prevent overfitting?
!"
!# %&
!$
Andrew Ng
Regularizing your
neural network
Dropout
deeplearning.ai regularization
Dropout regularization
!" !"
!# !#
$% $%
!' !'
!& !&
Andrew Ng
Implementing dropout (“Inverted dropout”)
Andrew Ng
Making predictions at test time
Andrew Ng
Regularizing your
neural network
Understanding
deeplearning.ai dropout
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to
spread out weights.
!"
!# $%
!&
Andrew Ng
Regularizing your
neural network
Other regularization
deeplearning.ai methods
Data augmentation
4
Andrew Ng
Early stopping
# iterations
Andrew Ng
Setting up your
optimization problem
Normalizing inputs
deeplearning.ai
Normalizing training sets
!" !"
3
!"
!# !#
!# 5
Andrew Ng
Why normalize inputs? 1
*
ℒ (01 + , 0 (+) )
! ", $ = (
' +,-
Unnormalized: Normalized:
! !
" "
$ $
$ $
Vanishing/exploding
deeplearning.ai
gradients
Vanishing/exploding gradients
!"
$%
!# =
Andrew Ng
Single neuron example
!"
!#
!$ &'
!% ( = *(,)
Andrew Ng
Setting up your
optimization problem
Numerical approximation
deeplearning.ai of gradients
Checking your derivative computation
Andrew Ng
Checking your derivative computation
!
Andrew Ng
Setting up your
optimization problem
Gradient Checking
deeplearning.ai
Gradient check for a neural network
Andrew Ng
Gradient checking (Grad check)
Andrew Ng
Setting up your
optimization problem
Gradient Checking
deeplearning.ai implementation notes
Gradient checking implementation notes
- Remember regularization.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Mini-batch
deeplearning.ai
gradient descent
Batch vs. mini-batch gradient descent
Vectorization allows you to efficiently compute on m examples.
Andrew Ng
Mini-batch gradient descent
Andrew Ng
Optimization
Algorithms
Understanding
deeplearning.ai mini-batch
gradient descent
Training with mini batch gradient descent
cost
# iterations mini batch # (t)
Andrew Ng
Choosing your mini-batch size
Andrew Ng
Choosing your mini-batch size
Andrew Ng
Andrew Ng
Andrew Ng
Optimization
Algorithms
Understanding
deeplearning.ai exponentially
weighted averages
Exponentially weighted averages
!" = $!"%& + (1 − $),"
temperature
days
Andrew Ng
Exponentially weighted averages
!/ = 0!/1" + (1 − 0)+/
Andrew Ng
Implementing exponentially weighted
averages
!" = 0
!% = &!" + (1 − &) -%
!/ = &!% + (1 − &) -/
!0 = &!/ + (1 − &) -0
…
Andrew Ng
Optimization
Algorithms
Bias correction
deeplearning.ai in exponentially
weighted average
Bias correction
temperature
days
!" = $!"%& + (1 − $),"
Andrew Ng
Optimization
Algorithms
Gradient descent
deeplearning.ai
with momentum
Gradient descent example
Andrew Ng
Implementation details
On iteration 8:
Compute )*, ), on the current mini-batch
!"# = %!"# + 1 − % )*
!"+ = %!"+ + 1 − % ),
* = * − -!"# , , = , − -!"+
Hyperparameters: -, % % = 0.9
Andrew Ng
Optimization
Algorithms
RMSprop
deeplearning.ai
RMSprop
Andrew Ng
Optimization
Algorithms
Adam optimization
deeplearning.ai
algorithm
Adam optimization algorithm
Andrew Ng
Hyperparameters choice:
Adam Coates
Andrew Ng
Optimization
Algorithms
Learning rate
deeplearning.ai
decay
Learning rate decay
Andrew Ng
Learning rate decay
Andrew Ng
Other learning rate decay methods
Andrew Ng
Optimization
Algorithms
The problem of
deeplearning.ai
local optima
Local optima in neural networks
Andrew Ng
Problem of plateaus
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Tuning process
deeplearning.ai
Hyperparameters
Andrew Ng
Try random values: Don’t use a grid
Hyperparameter 2 Hyperparameter 2
Hyperparameter 1
Hyperparameter 1
Andrew Ng
Coarse to fine
Hyperparameter 2
Hyperparameter 1
Andrew Ng
Hyperparameter
tuning
Using an appropriate
deeplearning.ai scale to pick
hyperparameters
Picking hyperparameters at random
Andrew Ng
Appropriate scale for hyperparameters
Andrew Ng
Hyperparameters for exponentially
weighted averages
Andrew Ng
Hyperparameters
tuning
Hyperparameters
deeplearning.ai tuning in practice:
Pandas vs. Caviar
Re-test hyperparameters occasionally
Idea
- NLP, Vision, Speech,
Ads, logistics, ….
Andrew Ng
Babysitting one Training many
model models in parallel
Normalizing activations
deeplearning.ai in a network
Normalizing inputs to speed up learning
!" ', )
!# %&
!$
!"
!# %&
!$
Andrew Ng
Implementing Batch Norm
Andrew Ng
Batch
Normalization
Andrew Ng
Working with mini-batches
Andrew Ng
Implementing gradient descent
Andrew Ng
Batch
Normalization
Why does
deeplearning.ai
Batch Norm work?
Learning on shifting input distribution
!"
!# %&
!$
Cat Non-Cat
% = 1 % =0 % = 1 % =0
Andrew Ng
Why this is a problem with neural networks?
!"
!# %&
!$
Andrew Ng
Batch Norm as regularization
• Each mini-batch is scaled by the mean/variance computed
on just that mini-batch.
• This adds some noise to the values + [-] within that
minibatch. So similar to dropout, it adds some noise to each
hidden layer’s activations.
• This has a slight regularization effect.
Andrew Ng
Multi-class
classification
Softmax regression
deeplearning.ai
Recognizing cats, dogs, and baby chicks
3 1 2 0 3 2 0 1
X !"
Andrew Ng
Softmax layer
X !"
Andrew Ng
Softmax examples
#% #% #%
#$ #$ #$
#% #% #%
#$ #$ #$ Andrew Ng
Programming
Frameworks
Deep Learning
deeplearning.ai
frameworks
Deep learning frameworks
• Caffe/Caffe2 Choosing deep learning frameworks
• CNTK - Ease of programming (development
• DL4J and deployment)
• Keras - Running speed
• Lasagne - Truly open (open source with good
governance)
• mxnet
• PaddlePaddle
• TensorFlow
• Theano
• Torch
Andrew Ng
Programming
Frameworks
TensorFlow
deeplearning.ai
Motivating problem
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Why ML
deeplearning.ai
Strategy?
Motivating example
Ideas:
• Collect more data • Try dropout
• Collect more diverse training set • Add !" regularization
• Train algorithm longer with gradient descent • Network architecture
• Try Adam instead of gradient descent • Activation functions
• Try bigger network • # hidden units
• Try smaller network • … Andrew Ng
Introduction to
ML strategy
Orthogonalization
deeplearning.ai
TV tuning example
Car
Andrew Ng
Chain of assumptions in ML
Andrew Ng
Setting up
your goal
Single number
deeplearning.ai
evaluation metric
Using a single number evaluation metric
Idea
Andrew Ng
Another example
Andrew Ng
Setting up
your goal
Satisficing and
deeplearning.ai
optimizing metrics
Another cat classification example
Classifier Accuracy Running time
A 90% 80ms
B 92% 95ms
C 95% 1,500ms
Andrew Ng
Setting up
your goal
Train/dev/test
deeplearning.ai
distributions
Cat classification dev/test sets
Regions:
• US
• UK
• Other Europe
• South America
• India
Idea
• China
• Other Asia
• Australia
Experiment Code
Andrew Ng
True story (details changed)
Andrew Ng
Guideline
Andrew Ng
Setting up
your goal
Size of dev
deeplearning.ai
and test sets
Old way of splitting data
Andrew Ng
Size of dev set
Set your dev set to be big enough to detect differences in
algorithm/models you’re trying out.
Andrew Ng
Size of test set
Set your test set to be big enough to give high confidence
in the overall performance of your system.
Andrew Ng
Setting up
your goal
When to change
deeplearning.ai dev/test sets and
metrics
Cat dataset examples
Andrew Ng
Orthogonalization for cat pictures: anti-porn
Andrew Ng
Another example
Algorithm A: 3% error
Algorithm B: 5% error
Dev/test User images
Andrew Ng
Comparing to human-
level performance
Why human-level
deeplearning.ai
performance?
Comparing to human-level performance
accuracy
time
Andrew Ng
Why compare to human-level performance
Humans are quite good at a lot of tasks. So long as
ML is worse than humans, you can:
- Get labeled data from humans.
Andrew Ng
Comparing to human-
level performance
Avoidable bias
deeplearning.ai
Bias and Variance
Andrew Ng
Bias and Variance
Cat classification
Andrew Ng
Cat classification example
Training error 8% 8%
Dev error 10% 10 %
Andrew Ng
Comparing to human-
level performance
Understanding
deeplearning.ai human-level
performance
Human-level error as a proxy for Bayes error
Medical image classification example:
Suppose:
(a) Typical human ………………. 3 % error
Training error
Dev error
Andrew Ng
Summary of bias/variance with human-level
performance
Human-level error
Training error
Dev error
Andrew Ng
Comparing to human-
level performance
Surpassing human-
deeplearning.ai
level performance
Surpassing human-level performance
Team of humans
One human
Training error
Dev error
Andrew Ng
Problems where ML significantly surpasses
human-level performance
- Online advertising
- Product recommendations
- Loan approvals
Andrew Ng
Comparing to human-
level performance
Andrew Ng
Reducing (avoidable) bias and variance
More data
Dev error Regularization
NN architecture/hyperparameters search
Andrew Ng
Orthogonalization
Orthogonalization or orthogonality is a system design property that assures that modifying an instruction
or a component of an algorithm will not create or propagate side effects to other components of the
system. It becomes easier to verify the algorithms independently from one another, it reduces testing and
development time.
When a supervised learning system is design, these are the 4 assumptions that needs to be true and
orthogonal.
Predict class 𝑦ො
1 0
1 True positive False positive
0 False negative True negative
Precision
Of all the images we predicted y=1, what fraction of it have cats?
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Precision (%) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑥 100 = (𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) 𝑥 100
Recall
Of all the images that actually have cats, what fraction of it did we correctly identifying have cats?
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Recall (%) = 𝑥 100 = 𝑥 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)
Let’s compare 2 classifiers A and B used to evaluate if there are cat images:
For classifier A, there is a 95% chance that there is a cat in the image and a 90% chance that it has correctly
detected a cat. Whereas for classifier B there is a 98% chance that there is a cat in the image and a 85%
chance that it has correctly detected a cat.
The problem with using precision/recall as the evaluation metric is that you are not sure which one is
better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine
both precision and recall.
2
F1-Score= 1 1
+
𝑝 𝑟
Classifier A is a better choice. F1-Score is not the only evaluation metric that can be use, the average, for
example, could also be an indicator of which classifier to use.
Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices.
They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation
matrices must be evaluated on a training set, a development set or on the test set.
In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric,
because you want the classifier to correctly detect a cat image as accurately as possible. The running time
which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has
to meet expectation set.
1 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑖𝑛𝑔 𝑚𝑒𝑡𝑟𝑖𝑐
𝑁𝑚𝑒𝑡𝑟𝑖𝑐 : {
𝑁𝑚𝑒𝑡𝑟𝑖𝑐 − 1 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑐𝑖𝑛𝑔 𝑚𝑒𝑡𝑟𝑖𝑐
Training, development and test distributions
Setting up the training, development and test sets have a huge impact on productivity. It is important to
choose the development and test sets from the same distribution and it must be taken randomly from all
the data.
Guideline
Choose a development set and test set to reflect data you expect to get in the future and consider
important to do well.
Size of the development and test sets
Old way of splitting data
We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas
and models.
70 % 30 %
60 % 20 % 20 %
98 % 1% 1%
1
Guidelines
• Set up the size of the test set to give a high confidence in the overall performance of the system.
• Test set helps evaluate the performance of the final classifier which could be less 30% of the whole
data set.
• The development set has to be big enough to evaluate different ideas.
When to change development/test sets and metrics
Example: Cat vs Non-cat
A cat classifier tries to find a great amount of cat images to show to cat loving users. The evaluation metric
used is a classification error.
It seems that Algorithm A is better than Algorithm B since there is only a 3% error, however for some reason,
Algorithm A is letting through a lot of the pornographic images.
Algorithm B has 5% error thus it classifies fewer images but it doesn't have pornographic images. From a
company's point of view, as well as from a user acceptance point of view, Algorithm B is actually a better
algorithm. The evaluation metric fails to correctly rank order preferences between algorithms. The evaluation
metric or the development set or test set should be changed.
The problem with this evaluation metric is that it treats pornographic vs non-pornographic images equally. On
way to change this evaluation metric is to add the weight term 𝑤 (𝑖) .
Guideline
1. Define correctly an evaluation metric that helps better rank order classifiers
2. Optimize the evaluation metric
Why human-level performance?
Today, machine learning algorithms can compete with human-level performance since they are more
productive and more feasible in a lot of application. Also, the workflow of designing and building a
machine learning system, is much more efficient than before.
Moreover, some of the tasks that humans do are close to ‘’perfection’’, which is why machine learning
tries to mimic human-level performance.
The graph below shows the performance of humans and machine learning over time.
Bayes optimal
error
Machine
Learning
Human
s
The
Machine learning progresses slowly when it surpasses human-level performance. One of the reason is
that human-level performance can be close to Bayes optimal error, especially for natural perception
problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions
mapping from x to y can’t surpass a certain level of accuracy.
Also, when the performance of machine learning is worse than the performance of humans, you can
improve it with different tools. They are harder to use once its surpasses human-level performance.
In this case, the human level error as a proxy for Bayes error since humans are good to identify images. If
you want to improve the performance of the training set but you can’t do better than the Bayes error
otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias
or variance avoidance tactics will improve the performance of the model.
Scenario A
There is a 7% gap between the performance of the training set and the human level error. It means that
the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we
use bias reduction technique such as training a bigger neural network or running the training set longer.
Scenario B
The training set is doing good since there is only a 0.5% difference with the human level error. The
difference between the training set and the human level error is called avoidable bias. The focus here is
to reduce the variance since the difference between the training error and the development error is 2%.
To resolve the issue, we use variance reduction technique such as regularization or have a bigger training
set.
Understanding human-level performance
Human-level error gives an estimate of Bayes error.
The definition of human-level error depends on the purpose of the analysis, in this case, by definition the
Bayes error is lower or equal to 0.5%.
Scenario B
In this case, the choice of human-level performance doesn’t have an impact. The avoidable bias is between
0%-0.5% and the variance is 4%. Therefore, the focus should be on variance reduction technique.
Scenario C
In this case, the estimate for Bayes error has to be 0.5% since you can’t go lower than the human-level
performance otherwise the training set is overfitting. Also, the avoidable bias is 0.2% and the variance is
0.1%. Therefore, the focus should be on bias reduction technique.
Scenario B
In this case, there is not enough information to know if bias reduction or variance reduction has to be
done on the algorithm. It doesn’t mean that the model cannot be improve, it means that the conventional
ways to know if bias reduction or variance reduction are not working in this case.
There are many problems where machine learning significantly surpasses human-level performance,
especially with structured data:
• Online advertising
• Product recommendations
• Logistics (predicting transit time)
• Loan approvals
Improving your model performance
There are 2 fundamental assumptions of supervised learning. The first one is to have a low avoidable bias
which means that the training set fits well. The second one is to have a low or acceptable variance which
means that the training set performance generalizes well to the development set and test set.
If the difference between human-level error and the training error is bigger than the difference between
the training error and the development error, the focus should be on bias reduction technique which are
training a bigger model, training longer or change the neural networks architecture or try various
hyperparameters search.
If the difference between training error and the development error is bigger than the difference between
the human-level error and the training error, the focus should be on variance reduction technique which
are bigger data set, regularization or change the neural networks architecture or try various
hyperparameters search.
Summary
• More data
• Regularization
• Neural Networks architecture/hyperparameters search
•
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Andrew Ng
Evaluate multiple ideas in parallel
Ideas for cat detection:
• Fix pictures of dogs being recognized as cats
• Fix great cats (lions, panthers, etc..) being misrecognized
• Improve performance on blurry images
Image
1
2
3
..
.
% of total
Andrew Ng
Error Analysis
Cleaning up
deeplearning.ai Incorrectly labeled
data
Incorrectly labeled examples
y 1 0 1 1 0 1 1
Andrew Ng
Error analysis
Incorrectly
Image Dog Great Cat Blurry Comments
labeled
…
Labeler missed cat
98 in background
99
Drawing of a cat;
100 Not a real cat.
Goal of dev set is to help you select between two classifiers A & B.
Andrew Ng
Correcting incorrect dev/test set examples
Andrew Ng
Speech recognition example
Training Dev/test
Voice keyboard
Andrew Ng
Mismatched training
and dev/test data
Training error
Dev error
Andrew Ng
More general formulation
Andrew Ng
Mismatched training
and dev/test data
Addressing data
deeplearning.ai
mismatch
Addressing data mismatch
• Carry out manual error analysis to try to understand difference
between training and dev/test sets
Andrew Ng
Artificial data synthesis
+ =
Andrew Ng
Artificial data synthesis
Car recognition:
Andrew Ng
Learning from
multiple tasks
Transfer learning
deeplearning.ai
Transfer learning
x !"
x !"
Andrew Ng
When transfer learning makes sense
Andrew Ng
Learning from
multiple tasks
Multi-task
deeplearning.ai
learning
Simplified autonomous driving example
Andrew Ng
Neural network architecture
x !"
Andrew Ng
When multi-task learning makes sense
• Training on a set of tasks that could benefit from having
shared lower-level features.
• Usually: Amount of data you have for each task is quite
similar.
Andrew Ng
End-to-end deep
learning
What is
deeplearning.ai end-to-end
deep learning
What is end-to-end learning?
Speech recognition example
Andrew Ng
Face recognition
Andrew Ng
More examples
Machine translation
Andrew Ng
End-to-end deep
learning
Whether to use
deeplearning.ai
end-to-end learning
Pros and cons of end-to-end deep learning
Pros:
• Let the data speak
• Less hand-designing of components needed
Cons:
• May need large amount of data
• Excludes potentially useful hand-designed
components
Andrew Ng
Applying end-to-end deep learning
Key question: Do you have sufficient data to learn
a function of the complexity needed to map x to y?
Andrew Ng
Build system quickly, then iterate
Depending on the area of application, the guideline below will help you prioritize when you build your
system.
Guideline
1. Set up development/ test set and metrics
- Set up a target
2. Build an initial system quickly
- Train training set quickly: Fit the parameters
- Development set: Tune the parameters
- Test set: Assess the performance
3. Use Bias/Variance analysis & Error analysis to prioritize next steps
Training and testing on different distributions
Example: Cat vs Non-cat
In this example, we want to create a mobile application that will classify and recognize pictures of cats
taken and uploaded by users.
There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000
pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not
professionally shot, not well framed and blurrier. The second source is from the web, you downloaded
200 000 pictures where cat’s pictures are professionally framed and in high resolution.
1- small data set from pictures uploaded by users. This distribution is important for the mobile app.
2- bigger data set from the web.
The guideline used is that you have to choose a development set and test set to reflect data you expect
to get in the future and consider important to do well.
The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the development and test set
distributions. However, this way of splitting the data has a better performance in long term.
Bias and variance with mismatched data distributions
Example: Cat classifier with mismatch data distribution
When the training set is from a different distribution than the development and test sets, the method to
analyze bias and variance changes.
Scenario A
If the development data comes from the same distribution as the training set, then there is a large
variance problem and the algorithm is not generalizing well from the training set.
However, since the training data and the development data come from a different distribution, this
conclusion cannot be drawn. There isn't necessarily a variance problem. The problem might be that the
development set contains images that are more difficult to classify accurately.
When the training set, development and test sets distributions are different, two things change at the
same time. First of all, the algorithm trained in the training set but not in the development set. Second of
all, the distribution of data in the development set is different.
It's difficult to know which of these two changes what produces this 9% increase in error between the
training set and the development set. To resolve this issue, we define a new subset called training-
development set. This new subset has the same distribution as the training set, but it is not used for
training the neural network.
Scenario B
The error between the training set and the training- development set is 8%. In this case, since the training
set and training-development set come from the same distribution, the only difference between them is
the neural network sorted the data in the training and not in the training development. The neural
network is not generalizing well to data from the same distribution that it hadn't seen before
Therefore, we have really a variance problem.
Scenario C
In this case, we have a mismatch data problem since the 2 data sets come from different distribution.
Scenario D
In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.
Scenario E
In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference
between Bayes error and training error is 10 % and the second one is a data mismatched problem.
Scenario F
Development should never be done on the test set. However, the difference between the development
set and the test set gives the degree of overfitting to the development set.
General formulation
Bayes error
Avoidable Bias
Variance
Data mismatch
• Perform manual error analysis to understand the error differences between training,
development/test sets. Development should never be done on test set to avoid overfitting.
• Make training data or collect data similar to development and test sets. To make the training data
more similar to your development set, you can use is artificial data synthesis. However, it is
possible that if you might be accidentally simulating data only from a tiny subset of the space of
all possible examples.
Transfer Learning
Transfer learning refers to using the neural network knowledge for another application.
Radiology diagnosis
Input 𝑥: Radiology images – CT Scan, X-rays
Output 𝑦 :Radiology diagnosis – 1: tumor malign, 0: tumor benign
Radiology images
𝑥
Radiology diagnosis
𝑦ො
Guideline
• Delete last layer of neural network
• Delete weights feeding into the last output layer of the neural network
• Create a new set of randomly initialized weights for the last layer only
• New data set (𝑥, 𝑦)
Multi-task learning
Multi-task learning refers to having one neural network do simultaneously several tasks.
| | | | 𝑌 = (4, 𝑚)
𝑌 = [𝑦 (1) (2) (3) (4) ]
𝑦 𝑦 𝑦
𝑌 = (4,1)
| | | |
Cars
Also, the cost can be compute such as it is not influenced by the fact that some entries are not labeled.
Example:
1 0 ? ?
0 1 ? 0
𝑌=[ ]
0 1 ? 1
? 0 1 0
What is end-to-end deep learning
End-to-end deep learning is the simplification of a processing or learning systems into one neural
network.
Audio Transcript
End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used
mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving
cars, etc.
Whether to use end-to-end deep learning
Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have
enough data to learn a function of the complexity needed to map x and y?
Pro:
• Let the data speak
- By having a pure machine learning approach, the neural network will learn from x to y. It will
be able to find which statistics are in the data, rather than being forced to reflect human
preconceptions.
Cons:
• Large amount of labeled data
- It cannot be used for every problem as it needs a lot of labeled data.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Computer vision
deeplearning.ai
Computer Vision Problems
Image Classification Neural Style Transfer
Cat? (0/1)
64x64
Object detection
Andrew Ng
Deep Learning on large images
Cat? (0/1)
64x64
!"
!#
⋮ ⋮ ⋮ %&
!$
Andrew Ng
Convolutional
Neural Networks
Edge detection
deeplearning.ai
example
Computer Vision Problem
vertical edges
Andrew Ng
Vertical edge detection
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
∗ 1 0 -1 =
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗
Andrew Ng
Convolutional
Neural Networks
More edge
deeplearning.ai
detection
Vertical edge detection examples
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
0 0 0 10 10 10
0 0 0 10 10 10 0 -30 -30 0
1 0 -1
0 0 0 10 10 10 0 -30 -30 0
0 0 0 10 10 10
∗ 1 0 -1 =
0 -30 -30 0
1 0 -1
0 0 0 10 10 10 0 -30 -30 0
0 0 0 10 10 10
Andrew Ng
Vertical and Horizontal Edge Detection
1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1
Vertical Horizontal
10 10 10 0 0 0
10 10 10 0 0 0 0 0 0 0
1 1 1
10 10 10 0 0 0 30 10 -10 -30
∗ 0 0 0 =
0 0 0 10 10 10 30 10 -10 -30
-1 -1 -1
0 0 0 10 10 10 0 0 0 0
0 0 0 10 10 10
Andrew Ng
Learning to detect edges
1 0 -1
1 0 -1
1 0 -1
3 0 1 2 7 4
1 5 8 9 3 1
#$ #% #&
2 7 2 5 1 3
#' #( #)
0 1 3 1 7 8
#* #+ #,
4 2 1 6 2 8
2 4 5 2 3 9
Andrew Ng
Convolutional
Neural Networks
Padding
deeplearning.ai
Padding
∗ =
Andrew Ng
Valid and Same convolutions
“Valid”:
Andrew Ng
Convolutional
Neural Networks
Strided
deeplearning.ai
convolutions
Strided convolution
2 3 3 4 7 43 4 4 6 34 2 4 9 4
6 1 6 0 9 21 8 0 7 12 4 0 3 2
3 -13 4 40 8 -143 3 40 8 -134 9 40 7 43 3 4 4
7 1 8 0 3 21 6 0 6 12 3 0 4 2 ∗ 1 0 2 =
4 -13 2 04 1 -134 8 40 3 -134 4 40 6 43 -1 0 3
3 1 2 0 4 12 1 0 9 12 8 0 3 2
0 -1 1 0 3 -13 9 0 2 -13 1 0 4 3
Andrew Ng
Summary of convolutions
padding p stride s
'()* +, '()* +,
+1 × +1
- -
Andrew Ng
Technical note on cross-correlation vs.
convolution
Convolution in math textbook:
2 3 7 4 6 2
6 6 9 8 7 4
3 4 5
3 4 8 3 8 9
∗ 1 0 2
7 8 3 6 6 3
-1 9 7
4 2 1 8 3 4
3 2 4 1 9 8
Andrew Ng
Convolutional
Neural Networks
Convolutions over
deeplearning.ai
volumes
Convolutions on RGB images
Andrew Ng
Convolutions on RGB image
∗ =
4x4
Andrew Ng
Multiple filters
∗ =
3x3x3 4x4
6x6x3 ∗ =
3x3x3
4x4
Andrew Ng
Convolutional
Neural Networks
One layer of a
deeplearning.ai
convolutional
network
Example of a layer
∗
3x3x3
6x6x3
∗
3x3x3
Andrew Ng
Number of parameters in one layer
Andrew Ng
Summary of notation
If layer l is a convolution layer:
#
" = filter size Input:
$ # = padding Output:
#
% = stride
#
&' = number of filters
Each filter is:
Activations:
Weights:
bias:
Andrew Ng
Convolutional
Neural Networks
A simple convolution
deeplearning.ai
network example
Example ConvNet
Andrew Ng
Types of layer in a convolutional network:
- Convolution
- Pooling
- Fully connected
Andrew Ng
Convolutional
Neural Networks
Pooling layers
deeplearning.ai
Pooling layer: Max pooling
1 3 2 1
2 9 1 1
1 3 2 3
5 6 1 2
Andrew Ng
Pooling layer: Max pooling
1 3 2 1 3
2 9 1 1 5
1 3 2 3 2
8 3 5 1 0
5 6 1 2 9
Andrew Ng
Pooling layer: Average pooling
1 3 2 1
2 9 1 1
1 4 2 3
5 6 1 2
Andrew Ng
Summary of pooling
Hyperparameters:
f : filter size
s : stride
Max or average pooling
Andrew Ng
Convolutional
Neural Networks
Convolutional neural
deeplearning.ai
network example
Neural network example
Andrew Ng
608
3216
48120
10164
850
Convolutional
Neural Networks
Why convolutions?
deeplearning.ai
Why convolutions
Andrew Ng
Why convolutions
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
(1
+
&
Cost , = + - ℒ((1 . , ( . )
./&
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Why look at
deeplearning.ai
case studies?
Outline
Classic networks:
• LeNet-5
• AlexNet
• VGG
ResNet
Inception
Andrew Ng
Case Studies
Classic networks
deeplearning.ai
LeNet - 5
avg pool avg pool
⋮
"#
5×5 f=2 5×5 f=2 ⋮
s=1 s=2 s=1 s=2
MAX-POOL
= ⋮ ⋮ ⋮
3×3 3×3 3×3 3×3
s=2
Softmax
same
1000
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 9216 4096 4096
[Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks] Andrew Ng
VGG - 16
CONV = 3×3 filter, s = 1, same MAX-POOL = 2×2 , s = 2
224×224 ×3
[Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition] Andrew Ng
Case Studies
Residual Networks
deeplearning.ai
(ResNets)
Residual block
![#%(]
![#] ![#%&]
' [#%(] = * [#%(] ![#] + , [#%(] ![#%(] = -(' [#%(] ) ' [#%&] = * [#%&] ![#%(] + , [#%&] ![#%&] = -(' [#%&] )
[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Residual Network
x ![#]
Plain ResNet
training error
training error
# layers # layers
[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Case Studies
Why ResNets
deeplearning.ai
work
Why do residual networks work?
Andrew Ng
ResNet
Plain
ResNet
[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Case Studies
Network in Network
deeplearning.ai
and 1×1 convolutions
Why does a 1 × 1 convolution do?
1 2 3 6 5 8
3 5 5 1 3 4
2 1 3 4 9 3
4 7 8 5 7 9
∗ 2 =
1 5 3 7 4 8
5 4 9 8 3 5
6×6
∗ =
6 × 6 × 32 1 × 1 × 32 6 × 6 × # filters
[Lin et al., 2013. Network in network] Andrew Ng
Using 1×1 convolutions
ReLU
CONV 1 × 1
32
28 × 28 × 32
28 × 28 × 192
Inception network
deeplearning.ai
motivation
Motivation for inception network
1×1
3×3
64
128
5×5 28
32
32
28
28 × 28 × 192 MAX-POOL
CONV
5 × 5,
same,
32 28 × 28 × 32
28 × 28 × 192
Andrew Ng
Using 1×1 convolution
CONV CONV
1 × 1, 5 × 5,
16, 32,
1 × 1 × 192 28 × 28 × 16 5 × 5 × 16 28 × 28 × 32
28 × 28 × 192
Andrew Ng
Case Studies
Inception network
deeplearning.ai
Inception module
1×1
CONV
1×1 3×3
CONV CONV
Previous Channel
Activation Concat
1×1 5×5
CONV CONV
MAXPOOL
3 × 3,s = 1
1×1
same CONV
Andrew Ng
Inception network
MobileNet
Motivation for MobileNets
• Low computational cost at deployment
• Useful for mobile and embedded vision
applications
• Key idea: Normal vs. depthwise-
separable convolutions
[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications] Andrew Ng
Normal Convolution
* =
3x3x3
4 4x x4 4x 5
6x6x3
Andrew Ng
Depthwise Separable Convolution
Normal Convolution
* =
* * =
Depthwise Pointwise
Andrew Ng
Depthwise Convolution
* =
3x3 4x4x3
6x6x3
Andrew Ng
Depthwise Separable Convolution
Depthwise Convolution
* =
Pointwise Convolution
* =
Andrew Ng
Pointwise Convolution
* =
1x1x3
4x4x3 4 x4 4x x4 5
Andrew Ng
Depthwise Separable Convolution
Normal Convolution
* =
* * =
Depthwise Pointwise
Andrew Ng
Cost Summary
Cost of normal convolution
[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications] Andrew Ng
Depthwise Separable Convolution
Depthwise Convolution
* =
Pointwise Convolution
* =
Andrew Ng
Convolutional
Neural Networks
MobileNet
Architecture
MobileNet
MobileNet v1
MobileNet v2
Residual Connection
[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet v2 Bottleneck
Residual Connection
[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet
MobileNet v1
MobileNet v2
Residual Connection
[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet v2 Full Architecture
[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
Convolutional
Neural Networks
EfficientNet
EfficientNet
Baseline
𝑦ො
Wider
Higher
Deeper Resolution
Compound scaling
𝑦ො 𝑦ො 𝑦ො 𝑦ො
[Tan and Le, 2019, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks] Andrew Ng
Practical advice for
using ConvNets
Transfer Learning
deeplearning.ai
Practical advice for
using ConvNets
Data augmentation
deeplearning.ai
Common augmentation method
Mirroring
+20,-20,+20
-20,+20,+20
+5,0,+50
Andrew Ng
Implementing distortions during training
Andrew Ng
Practical advice for
using ConvNets
The state of
deeplearning.ai
computer vision
Data vs. hand-engineering
Andrew Ng
Use open source code
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Object
deeplearning.ai
localization
What are localization and detection?
Image classification Classification with Detection
localization
Andrew Ng
Classification with localization
⋯ ⋮
1- pedestrian
2- car
3- motorcycle
4- background
Andrew Ng
Defining the target label y
1- pedestrian Need to output #$ , #& , #' , #( , class label (1-4)
2- car
3- motorcycle
4- background
Andrew Ng
Object Detection
Landmark
deeplearning.ai
detection
Landmark detection ConvNet
!" , !$ , !% , !&
Andrew Ng
Object Detection
Object
deeplearning.ai
detection
Car detection example
Training set:
x y
1
Andrew Ng
Sliding windows detection
Andrew Ng
Object Detection
Convolutional
deeplearning.ai implementation of
sliding windows
Turning FC layer into convolutional layers
MAX POOL FC FC
5×5 2×2 ⋮ ⋮
y
14 × 14 × 3 10 × 10 × 16 5 × 5 × 16 400 400 softmax (4)
MAX POOL FC FC
Andrew Ng
Convolution implementation of sliding windows
MAX POOL FC FC FC
MAX POOL FC FC FC
MAX POOL
MAX POOL
Andrew Ng
Object Detection
Bounding box
deeplearning.ai
predictions
Output accurate bounding boxes
Andrew Ng
YOLO algorithm
Labels for training
For each grid cell:
100
100
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Specify the bounding boxes
0.5
100 0.9
100
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Object Detection
Intersection
deeplearning.ai
over union
Evaluating object localization
More generally, IoU is a measure of the overlap between two bounding boxes.
Andrew Ng
Object Detection
Non-max
deeplearning.ai
suppression
Non-max suppression example
Andrew Ng
Non-max suppression example
0.6
0.8
0.9
0.3
0.5
Andrew Ng
Non-max suppression example
0.6
0.8
0.9
0.7
0.7
Andrew Ng
Non-max suppression algorithm
$%
&'
Each output prediction is: &(
&)
&*
Discard all boxes with $% ≤ 0.6
While there are any remaining boxes:
• Pick the box with the largest $%
Output that as a prediction.
19×19
• Discard any remaining box with
IoU ≥ 0.5 with the box output
in the previous step Andrew Ng
Object Detection
Anchor boxes
deeplearning.ai
Overlapping objects:
Anchor box 1: Anchor box 2:
!"
#$
#%
#&
y = #'
()
(*
(+
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Anchor box algorithm
Previously: With two anchor boxes:
Each object in training Each object in training
image is assigned to grid image is assigned to grid
cell that contains that cell that contains object’s
object’s midpoint. midpoint and anchor box
for the grid cell with
highest IoU.
Andrew Ng
Anchor box example !"
#$
#%
#&
#'
()
(*
(+
y = !"
#$
#%
#&
#'
Anchor box 1: Anchor box 2:
()
(*
(+
Andrew Ng
Object Detection
Putting it together:
deeplearning.ai
YOLO algorithm
Training 1 - pedestrian
'( 0 0
2 - car )* ? ?
)+ ? ?
3 - motorcycle
)- ? ?
). ? ?
/0 ? ?
/1 ? ?
/2 ? ?
y = '( 0 1
)* ? )*
)+ ? )+
)- ? )-
). ?
/0
).
? 0
/1 ?
/2 1
? 0
y is 3×3×2×8
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Making predictions
'(
)*
)+
)-
).
/0
⋯ 4= /1
/2
'(
)*
3×3×2×8 )+
)-
).
/0
/1
/2
Andrew Ng
Outputting the non-max supressed outputs
Andrew Ng
Object Detection
Region proposals
deeplearning.ai
(Optional)
Region proposal: R-CNN
[Girshik et. al, 2013, Rich feature hierarchies for accurate object detection and semantic segmentation] Andrew Ng
Faster algorithms
[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]
[Girshik, 2015. Fast R-CNN]
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks] Andrew Ng
Convolutional
Neural Networks
Semantic segmentation
with U-Net
Object Detection vs. Semantic Segmentation
Andrew Ng
Motivation for U-Net
[Novikov et al., 2017, Fully Convolutional Architectures for Multi-Class Segmentation in Chest Radiographs]
[Dong et al., 2017, Automatic Brain Tumor Detection and Segmentation Using U-Net Based Fully Convolutional Networks ] Andrew Ng
Per-pixel class labels
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
1. Car
000000011111100000000000 0. Not Car
001111111111111100000000
001111111111111111111110
001111111111111111111110
000011100000000000111000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
Andrew Ng
Per-pixel class labels
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
2 2 2 2 2 2 21 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1. Car 2 2 2 2 2 2 21 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2. Building 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3. Road 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
333311133333333333111 333 333311133333333333111 333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
Segmentation Map
Andrew Ng
Deep Learning for Semantic Segmentation
𝑦ො
Andrew Ng
Transpose Convolution
Normal Convolution
* =
Transpose Convolution
* =
Andrew Ng
Transpose Convolution
231 231 231
1 2 1
231 231 231
2 0 1 0 24
+2 0 1
231 231 231 2 +0
0 2 1 410
+6 7
+3+2
+4 1 3
26 +2
2 1
0
0 37
+4 0
0 2
2
3 2 weight filter
6 33
+0 4 2
2x2
4x4
𝑦ො
Andrew Ng
U-Net
Conv, RELU
Max Pool
Trans Conv
Skip Connection
Conv (1x1)
[Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation] Andrew Ng
U-Net
hxwx3 h x w x # classes
Conv, RELU
Max Pool
Trans Conv
Skip Connection
Conv (1x1)
[Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation] Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
What is face
deeplearning.ai
recognition?
Face recognition
Recognition
• Has a database of K persons
• Get an input image
• Output ID if the image is any of the K persons (or
“not recognized”)
Andrew Ng
Face recognition
One-shot learning
deeplearning.ai
One-shot learning
Learning from one
example to recognize the
person again
Andrew Ng
Learning a “similarity” function
d(img1,img2) = degree of difference between images
If d(img1,img2) ≤ -
> -
Andrew Ng
Face recognition
Siamese network
deeplearning.ai
Siamese network
⋮ ⋮
" ($)
⋮ ⋮
" (&)
[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Goal of learning
⋮ ⋮
f(" ($) )
Andrew Ng
Face recognition
Triplet loss
deeplearning.ai
Learning Objective
[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Loss function
[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Choosing the triplets A,P,N
[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Training set using triplet loss
Anchor Positive Negative
⋮ ⋮ ⋮
Andrew Ng
Face recognition
[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Face verification supervised learning
$ (
[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Neural Style
Transfer
⋮ ⋮ &'
26×26×256 13×13×256 13×13×384 13×13×384 6×6×256
55×55×96
FC FC
224×224×3 110×110×96 4096 4096
[Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks] Andrew Ng
Visualizing deep layers
Andrew Ng
Visualizing deep layers: Layer 1
Andrew Ng
Visualizing deep layers: Layer 2
Andrew Ng
Visualizing deep layers: Layer 3
Andrew Ng
Visualizing deep layers: Layer 3
Andrew Ng
Visualizing deep layers: Layer 4
Andrew Ng
Visualizing deep layers: Layer 5
Andrew Ng
Neural Style
Transfer
Cost function
deeplearning.ai
Neural style transfer cost function
Content C Style S
Generated image G
[Gatys et al., 2015. A neural algorithm of artistic style. Images on slide generated by Justin Johnson] Andrew Ng
Find the generated image G
1. Initiate G randomly
G: 100×100×3
Content cost
deeplearning.ai
function
Content cost function
" # = % "'()*+)* ,, # + / "0*12+ (4, #)
• Say you use hidden layer ! to compute content cost.
• Use pre-trained ConvNet. (E.g., VGG network)
• Let 6[2](9) and 6[2](:) be the activation of layer !
on the images
• If 6[2](9) and 6[2](:) are similar, both images have
similar content
"#
34 83
123 94 44 187
2 30
34 44 187 192
34 44 187 92 124
34 76 232
34 76 232 34
67 232
346776 83 124
194 142
⋮
83 194 94
67 83 194 202
%' %'
%( %(
%& %&
1D and 3D
deeplearning.ai
generalizations of
models
Convolutions in 2D and 1D
∗
2D filter
5×5
2D input image
14×14
1 20 15 3 18 12 4 17 1 3 10 3 1
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D data
Andrew Ng
3D convolution
∗
3D filter
3D volume
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Why sequence
deeplearning.ai
models?
Examples of sequence data
“The quick brown fox jumped
Speech recognition over the lazy dog.”
Music generation ∅
“There is nothing to like
Sentiment classification in this movie.”
Notation
deeplearning.ai
Motivating example
x: Harry Potter and Hermione Granger invented a new spell.
Andrew Ng
Representing words
x: Harry Potter and Hermione Granger invented a new spell.
! "#$ ! "%$ ! "&$ ⋯ ! "($
Andrew Ng
Representing words
x: Harry Potter and Hermione Granger invented a new spell.
! "#$ ! "%$ ! "&$ ⋯ ! "($
And = 367
Invented = 4700
A=1
New = 5976
Spell = 8376
Harry = 4075
Potter = 6830
Hermione = 4200
Gran… = 4000
Andrew Ng
Recurrent Neural
Networks
Recurrent Neural
deeplearning.ai
Network Model
Why not a standard network?
! "#$ ) "#$
! "%$ ) "%$
⋮ ⋮ ⋮ ⋮
! "'($ ) "'*$
Problems:
- Inputs, outputs can be different lengths in different examples.
- Doesn’t share features learned across different positions of text.
Andrew Ng
Recurrent Neural Networks
Andrew Ng
Simplified RNN notation
+"1$ = 3(566 +"1/#$ + 568 ! "1$ + 96 )
Andrew Ng
Recurrent Neural
Networks
Backpropagation
deeplearning.ai
through time
Forward propagation and backpropagation
'( "&$ '( ")$ '( "*$ '( "+. $
Andrew Ng
Forward propagation and backpropagation
Different types
deeplearning.ai
of RNNs
Examples of sequence data
“The quick brown fox jumped
Speech recognition over the lazy dog.”
Music generation ∅
“There is nothing to like
Sentiment classification in this movie.”
Andrew Ng
Examples of RNN architectures
Andrew Ng
Summary of RNN types
() #'% () #'% () #*% () #+, % ()
"#$% "#$% ⋯ "#$% ⋯
& #'% & & #'% & #*% & #+. %
One to one One to many Many to one
"#$% "#$% ⋯ ⋯ ⋯
⋯
Andrew Ng
Language modelling with an RNN
Training set: large corpus of english text.
Sampling novel
deeplearning.ai
sequences
Sampling a sequence from a trained RNN
'( "&$ '( "/$ '( "0$ '( ")* $
Andrew Ng
Character-level language model
Vocabulary = [a, aaron, …, zulu, <UNK>]
President enrique peña nieto, announced The mortal moon hath her eclipse in love.
sench’s sulk former coming football langston
paring. And subject of this thou art another this fold.
“I was not at all surprised,” said hich langston. When besser be my love to me see sabl’s.
“Concussion epidemic”, to be examined. For whose are ruse of mine eyes heaves.
Andrew Ng
Recurrent Neural
Networks
Vanishing gradients
deeplearning.ai
with RNNs
Vanishing gradients with RNNs
'( "&$ '( "-$ '( "/$ '( ")* $
% ⋮ ⋮ ⋮ ⋮ ⋯ ⋮ ⋮ ⋮ '(
Exploding gradients.
Andrew Ng
Recurrent Neural
Networks
Gated Recurrent
deeplearning.ai
Unit (GRU)
RNN unit
Andrew Ng
GRU (simplified)
Andrew Ng
Recurrent Neural
Networks
=#$% = ! #$%
=#$% = Γ? ∗ ! #$%
[Hochreiter & Schmidhuber 1997. Long short-term memory] Andrew Ng
LSTM in pictures
D #$%
=#$%
Γ8 = 9(,8 =#$12%, 4 #$% + 68 )
! #$12% * ⨁ ! #$%
--
Γ> = 9(,> =#$12%, 4 #$% + 6> ) tanh ! #$%
* =#$%
Γ? = 9(,? =#$12%, 4 #$% + 6? ) =#$12% B #$%
C #$%
!̃ #$% A #$%
*
=#$%
! #$% = Γ8 ∗ !̃ #$% + Γ> ∗ ! #$12%
forget gate update gate tanh output gate
=#$% = Γ? ∗ ! #$%
4 #$%
D #2% D #F% D #G%
softmax softmax softmax
Bidirectional RNN
deeplearning.ai
Getting information from the future
He said, “Teddy bears are on sale!”
He said, “Teddy Roosevelt was a great President!”
!" #)% !" #(% !" #*% !" #.% !" #-% !" #/% !" #$%
' #)% ' #(% ' #*% ' #.% ' #-% ' #/% ' #$%
He said, “Teddy bears are on sale!”
Andrew Ng
Bidirectional RNN (BRNN)
Andrew Ng
Recurrent Neural
Networks
Deep RNNs
deeplearning.ai
Deep RNN example
, "#$ , "%$ , "&$ , "'$
([#]"+$
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Word representation
deeplearning.ai
Word representation
V = [a, aaron, …, zulu, <UNK>]
1-hot representation
Andrew Ng
Featurized representation: word embedding
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
man
woman
dog
king
cat
queen fish
apple
grape
three four
one orange
two
[van der Maaten and Hinton., 2008. Visualizing data using t-SNE] Andrew Ng
NLP and Word
Embeddings
Using word
deeplearning.ai
embeddings
Named entity recognition example
1 1 0 0 0 0
Andrew Ng
Transfer learning and word embeddings
Andrew Ng
Relation to face encoding
[Taigman et. al., 2014. DeepFace: Closing the gap to human level performance] Andrew Ng
NLP and Word
Embeddings
Properties of word
deeplearning.ai
embeddings
Analogies
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
Gender −1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.70 0.69 0.03 -0.02
Food 0.09 0.01 0.02 0.01 0.95 0.97
[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations] Andrew Ng
Analogies using word vectors man
woman dog
king
cat
queen fish
Andrew Ng
Cosine similarity
345((, , (/0+1 − ()*+ + (,-)*+ )
Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia
Andrew Ng
NLP and Word
Embeddings
Embedding matrix
deeplearning.ai
Embedding matrix
Learning word
deeplearning.ai
embeddings
Neural language model
I want a glass of orange ______.
4343 9665 1 3852 6163 6257
I *+,+, 4 5+,+,
a *0 4 50
of *.0., 4 5.0.,
Last 1 word
Nearby 1 word
Andrew Ng
NLP and Word
Embeddings
Word2Vec
deeplearning.ai
Skip-grams
I want a glass of orange juice to go along with my cereal.
[Mikolov et. al., 2013. Efficient estimation of word representations in vector space.] Andrew Ng
Model
Vocab size = 10,000k
Andrew Ng
Problems with softmax classification
&'( )*
%
! "# =
-.,... &,( )*
∑01- %
Andrew Ng
NLP and Word
Embeddings
Negative sampling
deeplearning.ai
Defining a new learning problem
I want a glass of orange juice to go along with my cereal.
[Mikolov et. al., 2013. Distributed representation of words and phrases and their compositionality] Andrew Ng
Model
&'( )*
%
Softmax: ! "# =
-.,... &,( )*
∑01- % context word target?
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0
Andrew Ng
Selecting negative examples
context word target?
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0
Andrew Ng
NLP and Word
Embeddings
[Pennington et. al., 2014. GloVe: Global vectors for word representation] Andrew Ng
Model
Andrew Ng
A note on the featurization view of word
embeddings
Man Woman King Queen
(5391) (9853) (4914) (7157)
Gender −1 1 -0.95 0.97
Royal 0.01 0.02 0.93 0.95
Age 0.03 0.02 0.70 0.69
Food 0.09 0.01 0.02 0.01
6
minimize ∑78,888
*:7 ∑ 78,888
+:7 ( )*+ ,*- .+ + 0* − 0+2 − log )*+
Andrew Ng
NLP and Word
Embeddings
Sentiment
deeplearning.ai
classification
Sentiment classification problem
! "
The dessert is excellent.
Andrew Ng
Simple sentiment classification model
The dessert is excellent
8928 2468 4694 3180
is #'(%' , -'(%'
softmax
: ;+<
: ;*< :;&< :;)< :;'< ⋯ :;*+<
, , , , ,
Andrew Ng
NLP and Word
Embeddings
Debiasing word
deeplearning.ai
embeddings
The problem of bias in word embeddings
Man:Woman as King:Queen
[Bolukbasi et. al., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings] Andrew Ng
Addressing bias in word embeddings
1. Identify bias direction.
3. Equalize pairs.
[Bolukbasi et. al., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings] Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Basic models
deeplearning.ai
Sequence to sequence model
% "&$ % "*$ % "+$ % ",$ % "-$
Jane visite l’Afrique en septembre
Jane is visiting Africa in September.
. "&$ . "*$ . "+$ . ",$ . "-$ . "/$
!"#$ ⋯
% "&$ % "'( $
MAX-POOL MAX-POOL
11 × 11 3× 3 5× 5 3× 3 3×3 3×3
s=4 s=2 same s=2 same
55×55 ×96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384
MAX-POOL
= ⋮ ⋮ ⋮
3×3 3×3
s=2 Softmax
1000
13×13 ×384 13×13 ×256 6×6 ×256 9216 4096 4096
%
[Mao et. al., 2014. Deep captioning with multimodal recurrent neural networks]
[Vinyals et. al., 2014. Show and tell: Neural image caption generator]
[Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions] Andrew Ng
Sequence to
sequence models
% "&$ % ".$
% "&$ % "*+ $
Andrew Ng
Finding the most likely translation
Jane visite l’Afrique en septembre. /(' "&$ , … , ' "*, $ | %)
Andrew Ng
Why not a greedy search?
'( "&$ '( "*, $
!"#$ ⋯ ⋯
% "&$ % "*+ $
Andrew Ng
Sequence to
sequence models
Beam search
deeplearning.ai
Beam search algorithm
Step 1
a 0(' "&$ | %)
⋮
'( "&$
in
⋮ !"#$ ⋯
10000 jane
⋮ % "&$ % "*+ $
september
⋮
zulu
Andrew Ng
Beam search algorithm
Step 1 Step 2
!"#$ ⋯
a
⋮ % "&$ % "*+ $
in
⋮
10000 jane
⋮ !"#$ ⋯
september
% "&$ % "*+ $
⋮
zulu
!"#$ ⋯
jane is !"#$ ⋯
% "&$ % "*+ $
jane visits '( "7$
Refinements to
deeplearning.ai
beam search
Length normalization
45
45
45
Beam width B?
Andrew Ng
Sequence to
sequence models
Error analysis on
deeplearning.ai
beam search
Example
Jane visite l’Afrique en septembre.
!"#$ ⋯
% "&$ % "'( $
Andrew Ng
Error analysis on beam search
Human: Jane visits Africa in September. (+ ∗ )
Algorithm: Jane visited Africa last September. (+.)
Case 1:
Beam search chose +.. But + ∗ attains higher / + % .
Conclusion: Beam search is at fault.
Case 2:
+ ∗ is a better translation than +.. But RNN predicted / + ∗ % < / +. % .
Conclusion: RNN model is at fault.
Andrew Ng
Error analysis process
Human Algorithm / +∗ % / +. % At fault?
Bleu score
deeplearning.ai
(optional)
Evaluating machine translation
French: Le chat est sur le tapis.
[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu score on bigrams
Example: Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT output: The cat the cat on the mat.
the cat
cat the
cat on
on the
the mat
[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu score on unigrams
Example: Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT output: The cat the cat on the mat.
[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu details
!' = Bleu score on n-grams only
Combined Bleu score:
[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Sequence to
sequence models
Attention model
deeplearning.ai
intuition
The problem of long sequences "* $
'( "&$ '( ,
!"#$ ⋯ ⋯
% "&$ % "*+ $
Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré beaucoup de
gens merveilleux; elle est revenue en parlant comment son voyage était merveilleux, et elle me tente
d'y aller aussi.
Jane went to Africa last September, and enjoyed the culture and met many wonderful people;
she came back raving about how wonderful her trip was, and is tempting me to go too.
Bleu
score
'( "&$ '( ".$ '( "/$ '( "1$ '( "0$
!"#$
Attention model
deeplearning.ai
Attention model
)"*$
[Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate]
[Xu et. al., 2015. Show, attend and tell: Neural image caption generation with visual attention] Andrew Ng
Attention examples
July 20th 1969 1969 − 07 − 20
",,, .$
Visualization of + :
Andrew Ng
Audio data
Speech recognition
deeplearning.ai
Speech recognition problem
! #
audio clip transcript
Andrew Ng
Attention model for speech recognition
“T” “h”
% &'( % &)( ⋯
+&,( ⋯
Andrew Ng
CTC cost for speech recognition
(Connectionist temporal classification)
“the quick brown fox”
#. &'( #. &)( #. &',,,(
+&,( ⋯
Trigger word
deeplearning.ai
detection
What is trigger word detection?
Andrew Ng
Trigger word detection algorithm
!"#$
Andrew Ng
Conclusion
Summary and
deeplearning.ai
thank you
Specialization outline
Andrew Ng
Deep learning is a super power
Andrew Ng
Thank you.
-Andrew Ng
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Transformers
deeplearning.ai
Intuition
Transformers Motivation
Increased complexity,
sequential
Andrew Ng
Transformers Intuition
• Attention + CNN
• Self-Attention
• Multi-Head Attention
Self-Attention
deeplearning.ai
Self-Attention Intuition
𝐴(𝑞, 𝐾, 𝑉)= attention-based vector representation of a word
′ <𝑖> >
exp(𝑒 <𝑡,𝑡 > ) exp(𝑒 <𝑞∙𝑘 )
𝛼 <𝑡,𝑡 ′>
= 𝐴(𝑞, 𝐾, 𝑉)= ∑𝑖 <𝑗> > 𝑣 <𝑖>
𝑇 ′
∑ ′𝑥 exp(𝑒 <𝑡,𝑡 > ) ∑𝑗 exp(𝑒 <𝑞∙𝑘 )
𝑡 =1
+
Query (𝑄) Key (𝐾) Value (𝑉)
𝑣 <1> x 𝑣 <2>
x 𝑣 <3>
x 𝑣 <4>
x 𝑣 <5>
𝑞 <1> 𝑘 <1> 𝑣 <1>
𝑞 <2> 𝑘 <2> 𝑣 <2>
𝑞 <3> ∙ 𝑘 <1> 𝑞 <3> ∙ 𝑘 <2> 𝑞 <3> ∙ 𝑘 <4> 𝑞 <3> ∙ 𝑘 <5> 𝑞 <3> 𝑘 <3> 𝑣 <3>
𝑞 <4> 𝑘 <4> 𝑣 <4>
𝑞 <5> 𝑘 <5> 𝑣 <5>
𝑞 <1> , 𝑘 <1> , 𝑣 <1> 𝑞 <2> , 𝑘 <2> , 𝑣 <2> 𝑞 <3> , 𝑘 <3> , 𝑣 <3> 𝑞 <4> , 𝑘 <4> , 𝑣 <4> 𝑞 <5> , 𝑘 <5> , 𝑣 <5>
Multi-Head
deeplearning.ai
Attention
Multi-Head Attention
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝑐𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1 ℎ𝑒𝑎𝑑2 … ℎ𝑒𝑎𝑑𝑛 𝑊𝑜
𝑄
𝑸𝑲𝑻 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑊𝑖 𝑄, 𝑊𝑖𝐾 𝐾, 𝑊𝑖𝑉 𝑉)
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑸, 𝑲, 𝑽 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝑽
𝒅𝒌
Multi-Head
Attention
𝑸
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏(𝑾𝟏𝑸 𝑲
𝒊 𝑸, 𝑾𝟏
𝑽
𝒊 𝑲, 𝑾𝟏
𝒊 𝑽)
Q
W3 , W3K , W3V , Who?
𝑊1𝑖𝑄 𝑞<1> , 𝑊1𝑖𝐾 𝑘 <1> ,𝑊1𝑖𝑉 𝑣 <1> 𝑊1𝑖𝑄 𝑞<2> , 𝑊1𝑖𝐾 𝑘 <2> ,𝑊1𝑖𝑉 𝑣 <2> 𝑄
𝑊1𝑖𝑄 𝑞<3> , 𝑊1𝑖𝐾 𝑘 <3> ,𝑊1𝑖𝑉 𝑣 <3> 𝑊1𝑖𝑄 𝑞<4> , 𝑊1𝑖𝐾 𝑘 <4> ,𝑊1𝑖𝑉 𝑣 <4> 𝑊1𝑖𝑄 𝑞<5> , 𝑊1𝑖𝐾 𝑘 <5> ,𝑊1𝑖𝑉 𝑣 <5>
Q
W2 , W2K , W2V , When?
Q
W1 , W1K , W1V , Did what?
<1> <1> <1> <3> <3> <3>
𝑞 ,𝑘 ,𝑣 𝑞 <2> , 𝑘 <2> , 𝑣 <2> 𝑞 ,𝑘 ,𝑣 𝑞 <4> , 𝑘 <4> , 𝑣 <4> 𝑞 <5> , 𝑘 <5> , 𝑣 <5>
Transformers
deeplearning.ai
Transformer Details <SOS> Jane visits Africa in September <EOS>
Softmax
Encoder Decoder Linear
Masked
Multi-Head
Multi-Head
Attention
Positional Encoding Attention
+
𝑝𝑜𝑠
<SOS> 𝑥 <1> 𝑥 <2> … 𝑥 <𝑇𝑥−1> 𝑥 <𝑇𝑥> <EOS> 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛( 2𝑖
)
1000 𝑑
Jane visite l’Afrique en septembre 𝑝𝑜𝑠 +
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠( )
2𝑖 <SOS> 𝑦 <1> 𝑦 <2> … 𝑦 <𝑇𝑦−1> 𝑦 <𝑇𝑦>
1000 𝑑