0% found this document useful (0 votes)
6 views

DL Notes

The document provides an overview of a Deep Learning course offered by DeepLearning.AI, emphasizing the significance of AI in transforming various industries. It outlines the course structure, including topics such as neural networks, supervised learning, and logistic regression, while also detailing the educational use of the slides under a Creative Commons License. Additionally, it discusses the mathematical notations and representations used in deep learning, particularly in binary classification problems.

Uploaded by

rohan sinha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DL Notes

The document provides an overview of a Deep Learning course offered by DeepLearning.AI, emphasizing the significance of AI in transforming various industries. It outlines the course structure, including topics such as neural networks, supervised learning, and logistic regression, while also detailing the educational use of the slides under a Creative Commons License. Additionally, it discusses the mathematical notations and representations used in deep learning, particularly in binary classification problems.

Uploaded by

rohan sinha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 652

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Introduction to
Deep Learning

Welcome
deeplearning.ai

Andrew Ng
• AI is the new Electricity
• Electricity had once transformed
countless industries: transportation,
manufacturing, healthcare,
communications, and more

• AI will now bring about a n equally


big transformation.

Andrew Ng
What you’ll learn
Courses in this sequence (Specialization):
1. Neural Networks and Deep Learning
2. Improving Deep Neural Networks: Hyperparameter
tuning, Regularization and Optimization
3. Structuring your Machine Learning project
4. Convolutional Neural Networks
5. Natur al Language Processing: Building sequence models

Andrew Ng
Introduction to
Deep Learning

What is a
deeplearning.ai
Neural Network?
Housing Price Prediction
price

size of house
Housing Price Prediction
Housing Price Prediction

size !"

#bedrooms !#
y
zip code !$

wealth !%
Introduction to
Deep Learning
Supervised Learning
deeplearning.ai with Neural Networks
Supervised Learning
Input(x) Output (y) Application
Home features Price Real Estate
Ad, user info Click on ad? (0/1) Online Advertising

Image Object (1,…,1000) Photo tagging

Audio Text transcript Speech recognition

E n glish Chinese Machine translation

Image, Radar info Position of other cars Autonomous driving


Neural Network examples

Standard NN Convolutional NN Recurrent NN


Supervised Learning
Structured Data Unstructured Data
Size #bedrooms … Price (1000$s)
2104 3 400
1600 3 330
2400 3 369
… … …
3000 4 540
Audio Image

User Age Ad Id … Click


Four scores and seven
41 93242 1
80 93287 0 years ago…
18 87312 1
… … …
Text
27 71244 1
Introduction to
Neural Networks

Why is Deep
deeplearning.ai Learning taking off?
Andrew Ng
Scale drives deep learning progress
Performanc
e

Amount of data
Andrew Ng
Scale drives deep learning progress

Idea
• Data

• Computation

• Algorithms
Experiment Code

Andrew Ng
Introduction to
Neural Networks

About this Course


deeplearning.ai

Andrew Ng
Courses in this Specialization
1. Neural Networks and DeepLearning
2. Improving Deep Neural Networks:Hyperparameter
tuning, Regularization and Optimization
3. Structuring your Machine Learningproject
4. Convolutional Neural Networks
5. Natural Language Processing: Building sequencemodels

Andrew Ng
Outline of this Course
Week 1: Introduction

Week 2: Basics of Neural Network programming

Week 3: One hidden layer Neural Networks

Week 4: Deep Neural Networks

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Basics of Neural
Network Programming

Binary Classification
deeplearning.ai
Binary Classification

1 (cat) vs 0 (non cat)

Blue
Green 255 134 93 22
Red 255 134
123 202
94 22
83 2
255 231
123 42
34 22
94 83 4 30
44 187
123 94
34 83
44 2 192
34 187
76 232 124
34 4434 187
76 92
67 232 34 142
83 194
34 7667 232
83 124
194 94
67 83 194 202
Andrew Ng
Notation

Andrew Ng
Basics of Neural
Network Programming

Logistic Regression
deeplearning.ai
Logistic Regression

Andrew Ng
Basics of Neural
Network Programming

Logistic Regression
deeplearning.ai
cost function
Logistic Regression cost function
! "
!" = % & ' + ) , where % * = "#$ +,

Given (' (.) , ! (.) ),…,(' (1) , ! (1) ) , want !" (2) ≈ ! 2 .
Loss (error) function:

Andrew Ng
Basics of Neural
Network Programming

Gradient Descent
deeplearning.ai
Gradient Descent
' ,
Recap: !" = % & ( + * , % + = ,-. /0
6 6
1 1
1 &, * = 4
5 ℒ(!" 7 , ! (7) ) =
− 5
4 ! (7) log !" 7 + (1 − ! (7) ) log(1 − !" 7 )
78, 78,

Want to find &, * that minimize 1 &, *


1 &, *

*
& Andrew Ng
Gradient Descent

Andrew Ng
Basics of Neural
Network Programming

Derivatives
deeplearning.ai
Intuition about derivatives
! " = 3"

"
Andrew Ng
Basics of Neural
Network Programming

More derivatives
deeplearning.ai
examples
Intuition about derivatives
! " = "$

"
Andrew Ng
More derivative examples

Andrew Ng
Basics of Neural
Network Programming

Computation Graph
deeplearning.ai
Computation Graph

Andrew Ng
Basics of Neural
Network Programming

Derivatives with a
deeplearning.ai Computation Graph
Computing derivatives
&=5
11 33
"=3 6 $ =&+! ) = 3$
!="#
#=2

Andrew Ng
Computing derivatives
&=5
11 33
"=3 6 $ =&+! ) = 3$
!="#
#=2

Andrew Ng
Basics of Neural
Network Programming

Logistic Regression
deeplearning.ai
Gradient descent
Logistic regression recap

! = $%& + (
)* = + = ,(!)
ℒ +, ) = −() log(+) + (1 − )) log(1 − +))

Andrew Ng
Logistic regression derivatives
&%
$%
&( ! = $% &% + $( &( + ) * = +(!) ℒ(a, 1)
$(
b

Andrew Ng
Basics of Neural
Network Programming

Gradient descent
deeplearning.ai
on m examples
Logistic regression on m examples

Andrew Ng
Logistic regression on m examples

Andrew Ng
Basics of Neural
Network Programming

Vectorization
deeplearning.ai
What is vectorization?

Andrew Ng
Basics of Neural
Network Programming

More vectorization
deeplearning.ai
examples
Neural network programming guideline
Whenever possible, avoid explicit for-loops.

Andrew Ng
Vectors and matrix valued functions
Say you need to apply the exponential operation on every element of a
matrix/vector.

!$
!= ⋮
!&

u = np.zeros((n,1))
for i in range(n):
u[i]=math.exp(v[i])

Andrew Ng
Logistic regression derivatives
J = 0, dw1 = 0, dw2 = 0, db = 0
for i = 1 to n:
! (") = " $ # (") + %
&(") = '(! (") )
* += − - (") log -1 " + (1 − - " ) log(1 − -1 " )
d! (") = &(") (1 − &(") )
(")
d"% += #% d! (")
(")
d"' += #' d! (")
db += d! (")
J = J/m, d"% = d"% /m, d"' = d"' /m, db = db/m

Andrew Ng
Basics of Neural
Network Programming

Vectorizing Logistic
deeplearning.ai
Regression
Vectorizing Logistic Regression
! (#) = & ' ( (#) + * ! (-) = & ' ( (-) + * ! (.) = & ' ( (.) + *
+(#) = ,(! (#) ) +(-) = ,(! (-) ) +(.) = ,(! (.) )

Andrew Ng
Basics of Neural
Network Programming

Vectorizing Logistic
deeplearning.ai Regression’s Gradient
Computation
Vectorizing Logistic Regression

Andrew Ng
Implementing Logistic Regression

J = 0, d!! = 0, d!" = 0, db = 0
for i = 1 to m:
" ($) = ! & # ($) + %
&($) = '(" ($) )
* += − - ($) log & $ + (1 − - $ ) log(1 − & $ )
d" ($) = &($) −- ($)
($)
d!! += #! d" ($)
($)
d!" += #" d" ($)
db += d" ($)
J = J/m, d!! = d!! /m, d!" = d!" /m
db = db/m
Andrew Ng
Basics of Neural
Network Programming

Broadcasting in
deeplearning.ai
Python
Broadcasting example
Calories from Carbs, Proteins, Fats in 100g of different foods:
Apples Beef Eggs Potatoes
Carb 56.0 0.0 4.4 68.0
Protein 1.2 104.0 52.0 8.0
Fat 1.8 135.0 99.0 0.9

cal = A.sum(axis = 0)
percentage = 100*A/(cal.reshape(1,4))
Broadcasting example
1 101
2 100 102
+ =
3 103
4 104

1 2 3 100 200 300 101 202 303


+ =
4 5 6 104 205 306

1 2 3 100 101 102 103


+ =
4 5 6 200 204 205 206
General Principle
Basics of Neural
Network Programming

Explanation of logistic
deeplearning.ai regression cost function
(Optional)
Logistic regression cost function

Andrew Ng
Logistic regression cost function
If $ = 1: ( $ ) = $*
If $ = 0: ( $ ) = 1 − $*

Andrew Ng
Cost on m examples

Andrew Ng
Standard notations for Deep Learning ·Y ∈ Rny ×m is the label matrix

This document has the purpose of discussing a new standard for deep learning ·y (i) ∈ Rny is the output label for the ith example
mathematical notations.
·W [l] ∈ Rnumber of units in next layer × number of units in the previous layer is the
weight matrix,superscript [l] indicates the layer
1 Neural Networks Notations.
·b[l] ∈ Rnumber of units in next layer is the bias vector in the lth layer
General comments:
· superscript (i) will denote the ith training example while superscript [l] will ·ŷ ∈ Rny is the predicted output vector. It can also be denoted a[L] where L
denote the lth layer is the number of layers in the network.

Sizes: Common forward propagation equation examples:


·m : number of examples in the dataset
a = g [l] (Wx x(i) + b1 ) = g [l] (z1 ) where g [l] denotes the lth layer activation
·nx : input size function

·ny : output size (or number of classes) ŷ (i) = sof tmax(Wh h + b2 )


[l]
·nh : number of hidden units of the lth layer [l] [l] [l−1] [l] [l]
· General Activation Formula: aj = g [l] ( + bj ) = g [l] (zj )
P
k wjk ak
[0]
In a for loop, it is possible to denote nx = nh and ny = nh [number of layers +1] .
· J(x, W, b, y) or J(ŷ, y) denote the cost function.
·L : number of layers in the network.
Examples of cost function:
Objects:
Pm
·X ∈ Rnx ×m is the input matrix · JCE (ŷ, y) = − i=0 y (i) log ŷ (i)
Pm
·x(i) ∈ Rnx is the ith example represented as a column vector · J1 (ŷ, y) = i=0 | y (i) − ŷ (i) |

1
2

2 Deep Learning representations


For representations:
· nodes represent inputs, activations or outputs
· edges represent weights or biases

Here are several examples of Standard deep learning representations

Figure 1: Comprehensive Network: representation commonly used for Neural Figure 2: Simplified Network: a simpler representation of a two layer neural
[l]
Networks. For better aesthetic, we omitted the details on the parameters (wij network, both are equivalent.
[l]
and bi etc...) that should appear on the edges
Binary Classification

In a binary classification problem, the result is a discrete value output.

For example - account hacked (1) or not hacked (0)


- a tumor malign (1)or benign(0)

Example: Cat vs Non-Cat


The goal is to train a classifier for which the input is an image represented by a feature vector, 𝑥, and predicts
whether thecorresponding label𝑦is 1 or 0. In this case, whether this is a cat image(1)or a non-cat image
(0).

64

64

An image is stored in the computer in three separate matrices corresponding to the Red, Green, and Blue
color channels of the image. The three matrices have the same size as the image, for example, the
resolutionof the cat image is 64 pixels X 64 pixels, the three matrices (RGB) are 64 X 64 each.
The value in a cell represents the pixel intensity which will be used to create a feature vector of n-
dimension. In pattern recognition and machine learning, a feature vector represents an image, Then the
classifier's job is to determine whether it contain a picture of a cat or not.
To create a feature vector, 𝑥, the pixel intensity values will be “ unrolled” or “ reshaped” for each color. The
dimension of the input feature vector𝑥 is𝑛 = 64𝑥 64𝑥 3 = 12288.

red

green

blue
Logistic Regression

Logistic regression is a learning algorithm used in a supervised learning problem when the output 𝑦 are
all either zero or one. The goal of logistic regression is to minimize the error between its predictions and
training data.

Example: Cat vs No - cat


Given an image represented by a feature vector 𝑥, the algorithm will evaluate the probability of a cat
being in that image.

𝐺𝑖𝑣𝑒𝑛 𝑥 , 𝑦̂ = 𝑃(𝑦 = 1|𝑥), where 0 ≤ 𝑦̂ ≤ 1


The parameters used in Logistic regression are:

• The input features vector: 𝑥 ∈ ℝ𝑛𝑥 , where 𝑛𝑥 is the number of features


• The training label: 𝑦 ∈ 0,1
• The weights: 𝑤 ∈ ℝ𝑛𝑥 , where 𝑛𝑥 is the number of features
• The threshold: 𝑏 ∈ ℝ
• The output: 𝑦̂ = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
1
• Sigmoid function: s = 𝜎(𝑤 𝑇 𝑥 + 𝑏) = 𝜎(𝑧)= 1+ 𝑒 −𝑧

(𝑤 𝑇 𝑥 + 𝑏) is a linear function (𝑎𝑥 + 𝑏), but since we are looking for a probability constraint between
[0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

• If 𝑧 is a large positive number, then 𝜎(𝑧) = 1


• If 𝑧 is small or large negative number, then 𝜎(𝑧) = 0
• If 𝑧 = 0, then 𝜎(𝑧) = 0.5
Logistic Regression: Cost Function

To train the parameters 𝑤 and 𝑏, we need to define a cost function.

Recap:
𝑦̂ (𝑖) = 𝜎(𝑤 𝑇 𝑥 (𝑖) + 𝑏), where 𝜎(𝑧 (𝑖) )=
1 𝑥 (𝑖) the i-th training example
(𝑖)
1+ 𝑒 −𝑧

𝐺𝑖𝑣𝑒𝑛 {(𝑥 (1) , 𝑦 (1) ), ⋯ , (𝑥 (𝑚) , 𝑦 (𝑚) )}, 𝑤𝑒 𝑤𝑎𝑛𝑡 𝑦̂ (𝑖) ≈ 𝑦 (𝑖)

Loss (error) function:


The loss function measures the discrepancy between the prediction (𝑦̂ (𝑖) ) and the desired output (𝑦 (𝑖) ).
In other words, the loss function computes the error for a single training example.
1
𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = (𝑦̂ (𝑖) − 𝑦 (𝑖) )2
2

𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = −( 𝑦 (𝑖) log(𝑦̂ (𝑖) ) + (1 − 𝑦 (𝑖) )log(1 − 𝑦̂ (𝑖) )

• If 𝑦 (𝑖) = 1: 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − log(𝑦̂ (𝑖) ) where log(𝑦̂ (𝑖) ) and 𝑦̂ (𝑖) should be close to 1
• If 𝑦 (𝑖) = 0: 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − log(1 − 𝑦̂ (𝑖) ) where log(1 − 𝑦̂ (𝑖) ) and 𝑦̂ (𝑖) should be close to 0

Cost function
The cost function is the average of the loss function of the entire training set. We are going to find the
parameters 𝑤 𝑎𝑛𝑑 𝑏 that minimize the overall cost function.
𝑚 𝑚
1 1
𝐽(𝑤, 𝑏) = ∑ 𝐿(𝑦̂ (𝑖) , 𝑦 (𝑖) ) = − ∑[( 𝑦 (𝑖) log(𝑦̂ (𝑖) ) + (1 − 𝑦 (𝑖) )log(1 − 𝑦̂ (𝑖) )]
𝑚 𝑚
𝑖=1 𝑖=1
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


One hidden layer
Neural Network

Neural Networks
deeplearning.ai
Overview
What is a Neural Network?
!"
!# %&
!$ x

w ' = )*! + , - = .(') ℒ(-, %)

b
!"
!# %&
!$ x

4 ["] ' ["] = 4 ["] ! + , ["] -["] = .(' ["] ) ' [#] = 4 [#] -["] + , [#] -[#] = .(' [#] ) ℒ(-[#] , %)

, ["] 4 [#]
, [#] Andrew Ng
One hidden layer
Neural Network

Neural Network
deeplearning.ai
Representation
Neural Network Representation

!"

!# %&

!$

Andrew Ng
One hidden layer
Neural Network

Computing a
deeplearning.ai Neural Network’s
Output
Neural Network Representation

!" !"
!# ) ! ! + + -(') , = %& !# %&
' ,
!$ !$

' = )!! + +

, = -(')

Andrew Ng
Neural Network Representation

$( $(
$) # ! $ + & +(!) ' = ./ $) ./
! '
$* $*

! = #!$ + & $(
' = +(!) $) ./
$*
Andrew Ng
Neural Network Representation
" ", ["] ["] "
'"" )" = +" ! + /" , '" = 3()" )
!" " ", ["] ["] "
'#" )# = +# ! + /# , '# = 3()# )
!# %& ["] ["]
'$
"
)$" = +$" , ! + /$ , ' $ = 3()$" )
!$ ["] ["]
'(" )(" = +(" , ! + /( , '( = 3()(" )

Andrew Ng
Neural Network Representation learning
(""
Given input x:
%"
(," " "
! =$ %+' "
%, "
./
(-
%- ( " = )(! "
)
(0"

! , =$ , (" +' ,

( , = )(! ,
)

Andrew Ng
One hidden layer
Neural Network

Vectorizing across
deeplearning.ai
multiple examples
Vectorizing across multiple examples
'" =) " !++ "
!" ," = -(' " )
!# %&
'# =) # ," ++ #
!$
,# = -(' # )

Andrew Ng
Vectorizing across multiple examples
for i = 1 to m:
! " ($) = ' " ( ($) + * "
+ " ($) = ,(! " $ )
! - ($) = ' - + " ($) + * -
+ - ($) = ,(! - $
)

Andrew Ng
One hidden layer
Neural Network

Explanation
deeplearning.ai for vectorized
implementation
Justification for vectorized implementation

Andrew Ng
Recap of vectorizing across multiple examples
for i = 1 to m
!"
!# %& ' " ()) = , " ! ()) + . "
!$ / " ()) = 0(' " ) )
' # ()) = , # / " ()) + . #
/ # ()) = 0(' # ) )
1 = ! (") ! (#) … ! (2)
6" =, " 1+. "
7" = 0(6 " )
6# =, # 7" +. #
A["] = /["](") /["](#) … /["](2)
7# = 0(6 # )
Andrew Ng
One hidden layer
Neural Network

Activation functions
deeplearning.ai
Activation functions
!"

!# %&
!$

Given x:
'" =) " !++ "
," = -(' " )
'# =) # ," ++ #
,# = -(' # ) Andrew Ng
Pros and cons of activation functions
a a

x
z
1
sigmoid: ! =
1 + & '(
a a

z z
Andrew Ng
One hidden layer
Neural Network

Why do you
deeplearning.ai need non-linear
activation functions?
Activation function
%"

%. 01
%/

Given x:
" "
! =$ %+' "
( " = )["] (! " )
! . =$ . (" +' .
( . = )[.] (! . )
Andrew Ng
One hidden layer
Neural Network

Derivatives of
deeplearning.ai activation functions
Sigmoid activation function

a
1
!(#) =
1 + ) *+
z

Andrew Ng
Tanh activation function
a
!(#) = tanh(#)

Andrew Ng
ReLU and Leaky ReLU
a a

z z
ReLU Leaky ReLU

Andrew Ng
One hidden layer
Neural Network

Gradient descent for


deeplearning.ai neural networks
Gradient descent for neural networks

Andrew Ng
Formulas for computing derivatives

Andrew Ng
One hidden layer
Neural Network

Backpropagation
deeplearning.ai intuition (Optional)
Computing gradients
Logistic regression
%
# ! = #$% + ' ) = *(!) ℒ(), /)
'

Andrew Ng
Neural network gradients
& [$]
' )[$]
& ["] ! [#] = & [#] ' + ) [#] +[#] = ,(! [#] ) ! [0] = & [0] ' + ) [0] +[0] = ,(! [0] ) ℒ(+[0] , y)

)["]

Andrew Ng
Summary of gradient descent
!' [$] = ([$] − 5
*
!" [$] = [$]
!' ( )

!+ [$] = !' [$]

!' [)] = " $,


!' [$] ∗ .[)] ′(z ) )

!" [)] = !' [)] 3 ,

!+ [)] = !' [)]


Andrew Ng
Summary of gradient descent
!" [$] = '[$] − ) !6 ["] = 7["] − 8

, 1 ["] $ ,
!* [$] = [$]
!" ' + ["]
!* = !6 7
:
1
!- [$] = !" [$] !- = ;<. >?:(!6 " , '5A> = 1, BCC<!A:> = DE?C)
["]
:

!" [+] = * $.
!" [$] ∗ 0[+] ′(z + ) !6 [$] = * " % !6 ["] ∗ 0[$] ′(Z $ )

[+] [+] . 1
!* = !" 5 !* [$] = !6 [$] G %
:
1
!- [+] = !" [+] !-[$] = ;<. >?:(!6 $ , '5A> = 1, BCC<!A:> = DE?C)
:
Andrew Ng
One hidden layer
Neural Network

Random Initialization
deeplearning.ai
What happens if you initialize weights to
zero?
[!]
"# !!
[$]
!! %&
[!]
"$ !$

Andrew Ng
Random initialization
[!]
"# !!
[$]
!! %&
[!]
"$ !$

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Deep Neural
Networks

Deep L-layer
deeplearning.ai Neural network
What is a deep neural network?

logistic regression 1 hidden layer

2 hidden layers 5 hidden layers


Andrew
Ng
Deep neural network notation

Andrew
Ng
Deep Neural
Networks

Forward Propagation
deeplearning.ai in a Deep Network
Forward propagation in a deep network

Andrew
Ng
Deep Neural
Networks
Getting your matrix
deeplearning.ai
dimensions right
Parameters ! ["] and " ["]

&!
$%
&"

Andrew Ng
Vectorized implementation

#!
!"
#"

Andrew Ng
Deep Neural
Networks
Why deep
deeplearning.ai
representations?
Intuition about deep representation

!"

Andrew Ng
Circuit theory and deep learning
Informally: There are functions you can compute with a
“small” L-layer deep neural network that shallower networks
require exponentially more hidden units to compute.

Andrew Ng
Deep Neural
Networks

Building blocks of
deeplearning.ai deep neural networks
Forward and backward functions

Andrew
Ng
Forward and backward functions

Andrew
Ng
Deep Neural
Networks
Parameters vs
deeplearning.ai
Hyperparameters
What are hyperparameters?
Parameters: ! " , % " ,! & ,% & ,! ' ,% ' …

Andrew Ng
Applied deep learning is a very
empirical process
Idea

cost !

Experiment Code # of iterations

Andrew Ng
Deep Neural
Networks
What does this
deeplearning.ai have to do with
the brain?
Forward and backward propagation
! ["] = # ["] $ + & ["] -! [%] = '[%] − +
1
'["] = ( " (! " ) -# [%] = -! % ' %
&

0
! [$] = # [$] '["] + & [$] 1
'[$] = ( $ (! $ ) -& [%] = 12. sum(d! % , 9:;< = 1, =>>2-;0< = ?@A>)
0 & %
-! [%'"] = -# % -! % (( (! %'" )

'[%] = ( % ! % = +,


& "
-! ["] = -# % -! $ (( (! " )
1 &
-# = -! " ' "
["]
0
1
-& = 12. sum(d! " , 9:;< = 1, =>>2-;0< = ?@A>)
["]
0

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Setting up your
ML application

Train/dev/test
deeplearning.ai sets
Applied ML is a highly iterative process

# layers Idea

# hidden units

learning rates

activation functions

… Experiment Code

Andrew Ng
Train/dev/test sets

Andrew Ng
Mismatched train/test distribution

Training set: Dev/test sets:


Cat pictures from Cat pictures from
webpages users using your app

Not having a test set might be okay. (Only dev set.)

Andrew Ng
Setting up your
ML application

Bias/Variance
deeplearning.ai
Bias and Variance

high bias “just right” high variance

Andrew Ng
Bias and Variance
Cat classification

Train set error:


Dev set error:

Andrew Ng
High bias and high variance
!#

!"

Andrew Ng
Setting up your
ML application
Basic “recipe”
deeplearning.ai for machine learning
Basic recipe for machine learning

Andrew Ng
Regularizing your
neural network

Regularization
deeplearning.ai
Logistic regression
min '(), *)
$,&

Andrew Ng
Neural network

Andrew Ng
Regularizing your
neural network

Why regularization
deeplearning.ai reduces overfitting
How does regularization prevent overfitting?
!"
!# %&
!$

high bias “just right” high variance


Andrew Ng
How does regularization prevent overfitting?

Andrew Ng
Regularizing your
neural network

Dropout
deeplearning.ai regularization
Dropout regularization

!" !"

!# !#
$% $%
!' !'

!& !&

Andrew Ng
Implementing dropout (“Inverted dropout”)

Andrew Ng
Making predictions at test time

Andrew Ng
Regularizing your
neural network

Understanding
deeplearning.ai dropout
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to
spread out weights.

!"

!# $%

!&

Andrew Ng
Regularizing your
neural network

Other regularization
deeplearning.ai methods
Data augmentation

4
Andrew Ng
Early stopping

# iterations

Andrew Ng
Setting up your
optimization problem

Normalizing inputs
deeplearning.ai
Normalizing training sets

!" !"
3

!"
!# !#

!# 5

Andrew Ng
Why normalize inputs? 1
*

ℒ (01 + , 0 (+) )
! ", $ = (
' +,-

Unnormalized: Normalized:
! !

" "
$ $

$ $

" " Andrew Ng


Setting up your
optimization problem

Vanishing/exploding
deeplearning.ai
gradients
Vanishing/exploding gradients
!"
$%
!# =

Andrew Ng
Single neuron example
!"
!#
!$ &'

!% ( = *(,)

Andrew Ng
Setting up your
optimization problem
Numerical approximation
deeplearning.ai of gradients
Checking your derivative computation

Andrew Ng
Checking your derivative computation
!

Andrew Ng
Setting up your
optimization problem

Gradient Checking
deeplearning.ai
Gradient check for a neural network

Take ! " , $ ["] , … , ! ( , $ ( and reshape into a big vector ).

Take +! " , +$ ["] , … , +! ( , +$ ( and reshape into a big vector d).

Andrew Ng
Gradient checking (Grad check)

Andrew Ng
Setting up your
optimization problem
Gradient Checking
deeplearning.ai implementation notes
Gradient checking implementation notes

- Don’t use in training – only to debug

- If algorithm fails grad check, look at components to try to identify bug.

- Remember regularization.

- Doesn’t work with dropout.

- Run at random initialization; perhaps again after some training.


Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Optimization
Algorithms

Mini-batch
deeplearning.ai
gradient descent
Batch vs. mini-batch gradient descent
Vectorization allows you to efficiently compute on m examples.

Andrew Ng
Mini-batch gradient descent

Andrew Ng
Optimization
Algorithms

Understanding
deeplearning.ai mini-batch
gradient descent
Training with mini batch gradient descent

Batch gradient descent Mini-batch gradient descent


cost

cost
# iterations mini batch # (t)

Andrew Ng
Choosing your mini-batch size

Andrew Ng
Choosing your mini-batch size

Andrew Ng
Andrew Ng
Andrew Ng
Optimization
Algorithms

Understanding
deeplearning.ai exponentially
weighted averages
Exponentially weighted averages
!" = $!"%& + (1 − $),"
temperature

days
Andrew Ng
Exponentially weighted averages
!/ = 0!/1" + (1 − 0)+/

!"## = 0.9!(( + 0.1+"##


!(( = 0.9!(, + 0.1+((
!(, = 0.9!(- + 0.1+(,

Andrew Ng
Implementing exponentially weighted
averages
!" = 0
!% = &!" + (1 − &) -%
!/ = &!% + (1 − &) -/
!0 = &!/ + (1 − &) -0

Andrew Ng
Optimization
Algorithms

Bias correction
deeplearning.ai in exponentially
weighted average
Bias correction

temperature

days
!" = $!"%& + (1 − $),"

Andrew Ng
Optimization
Algorithms

Gradient descent
deeplearning.ai
with momentum
Gradient descent example

Andrew Ng
Implementation details
On iteration 8:
Compute )*, ), on the current mini-batch
!"# = %!"# + 1 − % )*
!"+ = %!"+ + 1 − % ),
* = * − -!"# , , = , − -!"+

Hyperparameters: -, % % = 0.9
Andrew Ng
Optimization
Algorithms

RMSprop
deeplearning.ai
RMSprop

Andrew Ng
Optimization
Algorithms

Adam optimization
deeplearning.ai
algorithm
Adam optimization algorithm

yhat = np.array([.9, 0.2, 0.1, .4, .9])

Andrew Ng
Hyperparameters choice:

Adam Coates
Andrew Ng
Optimization
Algorithms

Learning rate
deeplearning.ai
decay
Learning rate decay

Andrew Ng
Learning rate decay

Andrew Ng
Other learning rate decay methods

Andrew Ng
Optimization
Algorithms

The problem of
deeplearning.ai
local optima
Local optima in neural networks

Andrew Ng
Problem of plateaus

• Unlikely to get stuck in a bad local optima


• Plateaus can make learning slow

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Hyperparameter
tuning

Tuning process
deeplearning.ai
Hyperparameters

Andrew Ng
Try random values: Don’t use a grid

Hyperparameter 2 Hyperparameter 2
Hyperparameter 1

Hyperparameter 1
Andrew Ng
Coarse to fine
Hyperparameter 2

Hyperparameter 1

Andrew Ng
Hyperparameter
tuning

Using an appropriate
deeplearning.ai scale to pick
hyperparameters
Picking hyperparameters at random

Andrew Ng
Appropriate scale for hyperparameters

Andrew Ng
Hyperparameters for exponentially
weighted averages

Andrew Ng
Hyperparameters
tuning

Hyperparameters
deeplearning.ai tuning in practice:
Pandas vs. Caviar
Re-test hyperparameters occasionally

Idea
- NLP, Vision, Speech,
Ads, logistics, ….

- Intuitions do get stale.


Re-evaluate occasionally.
Experiment Code

Andrew Ng
Babysitting one Training many
model models in parallel

Panda Caviar Andrew Ng


Batch
Normalization

Normalizing activations
deeplearning.ai in a network
Normalizing inputs to speed up learning
!" ', )
!# %&
!$

!"
!# %&
!$
Andrew Ng
Implementing Batch Norm

Andrew Ng
Batch
Normalization

Fitting Batch Norm


deeplearning.ai into a neural network
Adding Batch Norm to a network
!"
!# %&
!$

Andrew Ng
Working with mini-batches

Andrew Ng
Implementing gradient descent

Andrew Ng
Batch
Normalization

Why does
deeplearning.ai
Batch Norm work?
Learning on shifting input distribution
!"
!# %&
!$
Cat Non-Cat
% = 1 % =0 % = 1 % =0

Andrew Ng
Why this is a problem with neural networks?

!"
!# %&
!$

Andrew Ng
Batch Norm as regularization
• Each mini-batch is scaled by the mean/variance computed
on just that mini-batch.
• This adds some noise to the values + [-] within that
minibatch. So similar to dropout, it adds some noise to each
hidden layer’s activations.
• This has a slight regularization effect.

Andrew Ng
Multi-class
classification

Softmax regression
deeplearning.ai
Recognizing cats, dogs, and baby chicks

3 1 2 0 3 2 0 1

X !"

Andrew Ng
Softmax layer

X !"

Andrew Ng
Softmax examples

#% #% #%

#$ #$ #$

#% #% #%

#$ #$ #$ Andrew Ng
Programming
Frameworks

Deep Learning
deeplearning.ai
frameworks
Deep learning frameworks
• Caffe/Caffe2 Choosing deep learning frameworks
• CNTK - Ease of programming (development
• DL4J and deployment)
• Keras - Running speed
• Lasagne - Truly open (open source with good
governance)
• mxnet
• PaddlePaddle
• TensorFlow
• Theano
• Torch

Andrew Ng
Programming
Frameworks

TensorFlow
deeplearning.ai
Motivating problem

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Introduction to
ML strategy

Why ML
deeplearning.ai
Strategy?
Motivating example

Ideas:
• Collect more data • Try dropout
• Collect more diverse training set • Add !" regularization
• Train algorithm longer with gradient descent • Network architecture
• Try Adam instead of gradient descent • Activation functions
• Try bigger network • # hidden units
• Try smaller network • … Andrew Ng
Introduction to
ML strategy

Orthogonalization
deeplearning.ai
TV tuning example
Car

Andrew Ng
Chain of assumptions in ML

Fit training set well on cost function

Fit dev set well on cost function

Fit test set well on cost function

Performs well in real world

Andrew Ng
Setting up
your goal

Single number
deeplearning.ai
evaluation metric
Using a single number evaluation metric

Idea

Classifier Precision Recall F1 Score


A 95% 90% 92.4%
B 98% 85% 91.0%
Experiment Code

Andrew Ng
Another example

Algorithm US China India Other Average


A 3% 7% 5% 9% 6%
B 5% 6% 5% 10% 6.5%
C 2% 3% 4% 5% 3.5%
D 5% 8% 7% 2% 5.25%
E 4% 5% 2% 4% 3.75%
F 7% 11% 8% 12% 9.5%

Andrew Ng
Setting up
your goal

Satisficing and
deeplearning.ai
optimizing metrics
Another cat classification example
Classifier Accuracy Running time
A 90% 80ms
B 92% 95ms
C 95% 1,500ms

Andrew Ng
Setting up
your goal

Train/dev/test
deeplearning.ai
distributions
Cat classification dev/test sets
Regions:
• US
• UK
• Other Europe
• South America
• India
Idea
• China
• Other Asia
• Australia

Experiment Code
Andrew Ng
True story (details changed)

Optimizing on dev set on loan approvals for


medium income zip codes

Tested on low income zip codes

Andrew Ng
Guideline

Choose a dev set and test set to reflect data you


expect to get in the future and consider important
to do well on.

Andrew Ng
Setting up
your goal

Size of dev
deeplearning.ai
and test sets
Old way of splitting data

Andrew Ng
Size of dev set
Set your dev set to be big enough to detect differences in
algorithm/models you’re trying out.

Andrew Ng
Size of test set
Set your test set to be big enough to give high confidence
in the overall performance of your system.

Andrew Ng
Setting up
your goal

When to change
deeplearning.ai dev/test sets and
metrics
Cat dataset examples

Metric: classification error


Algorithm A: 3% error
Algorithm B: 5% error

Andrew Ng
Orthogonalization for cat pictures: anti-porn

1. So far we’ve only discussed how to define a metric to


evaluate classifiers.

2. Worry separately about how to do well on this metric.

Andrew Ng
Another example
Algorithm A: 3% error
Algorithm B: 5% error
Dev/test User images

If doing well on your metric + dev/test set does not


correspond to doing well on your application, change your
metric and/or dev/test set.

Andrew Ng
Comparing to human-
level performance

Why human-level
deeplearning.ai
performance?
Comparing to human-level performance

accuracy

time

Andrew Ng
Why compare to human-level performance
Humans are quite good at a lot of tasks. So long as
ML is worse than humans, you can:
- Get labeled data from humans.

- Gain insight from manual error analysis:


Why did a person get this right?

- Better analysis of bias/variance.

Andrew Ng
Comparing to human-
level performance

Avoidable bias
deeplearning.ai
Bias and Variance

high bias “just right” high variance

Andrew Ng
Bias and Variance
Cat classification

Training set error: 1% 15% 15% 0.5%


Dev set error: 11% 16% 30% 1%

Andrew Ng
Cat classification example

Training error 8% 8%
Dev error 10% 10 %

Andrew Ng
Comparing to human-
level performance
Understanding
deeplearning.ai human-level
performance
Human-level error as a proxy for Bayes error
Medical image classification example:
Suppose:
(a) Typical human ………………. 3 % error

(b) Typical doctor ………………... 1 % error

(c) Experienced doctor …………... 0.7 % error

(d) Team of experienced doctors .. 0.5 % error

What is “human-level” error?


Andrew Ng
Error analysis example

Training error

Dev error

Andrew Ng
Summary of bias/variance with human-level
performance

Human-level error

Training error

Dev error

Andrew Ng
Comparing to human-
level performance

Surpassing human-
deeplearning.ai
level performance
Surpassing human-level performance

Team of humans

One human

Training error

Dev error

Andrew Ng
Problems where ML significantly surpasses
human-level performance

- Online advertising

- Product recommendations

- Logistics (predicting transit time)

- Loan approvals

Andrew Ng
Comparing to human-
level performance

Improving your model


deeplearning.ai
performance
The two fundamental assumptions of
supervised learning

1. You can fit the training set pretty well.

2. The training set performance generalizes pretty


well to the dev/test set.

Andrew Ng
Reducing (avoidable) bias and variance

Human-level Train bigger model


Train longer/better optimization algorithms

Training error NN architecture/hyperparameters search

More data
Dev error Regularization

NN architecture/hyperparameters search
Andrew Ng
Orthogonalization

Orthogonalization or orthogonality is a system design property that assures that modifying an instruction
or a component of an algorithm will not create or propagate side effects to other components of the
system. It becomes easier to verify the algorithms independently from one another, it reduces testing and
development time.

When a supervised learning system is design, these are the 4 assumptions that needs to be true and
orthogonal.

1. Fit training set well in cost function


- If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization
algorithm might help.
2. Fit development set well on cost function
- If it doesn’t fit well, regularization or using bigger training set might help.
3. Fit test set well on cost function
- If it doesn’t fit well, the use of a bigger development set might help
4. Performs well in real world
- If it doesn’t perform well, the development test set is not set correctly or the cost function is
not evaluating the right thing.
Single number evaluation metric
To choose a classifier, a well-defined development set and an evaluation metric speed up the iteration
process.

Example : Cat vs Non- cat


y = 1, cat image detected
Actual class 𝑦

Predict class 𝑦ො
1 0
1 True positive False positive
0 False negative True negative
Precision
Of all the images we predicted y=1, what fraction of it have cats?
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Precision (%) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑥 100 = (𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) 𝑥 100

Recall

Of all the images that actually have cats, what fraction of it did we correctly identifying have cats?
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Recall (%) = 𝑥 100 = 𝑥 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)

Let’s compare 2 classifiers A and B used to evaluate if there are cat images:

Classifier Precision (p) Recall (r)


A 95% 90%
B 98% 85%

In this case the evaluation metrics are precision and recall.

For classifier A, there is a 95% chance that there is a cat in the image and a 90% chance that it has correctly
detected a cat. Whereas for classifier B there is a 98% chance that there is a cat in the image and a 85%
chance that it has correctly detected a cat.

The problem with using precision/recall as the evaluation metric is that you are not sure which one is
better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine
both precision and recall.
2
F1-Score= 1 1
+
𝑝 𝑟

Classifier Precision (p) Recall (r) F1-Score


A 95% 90% 92.4 %
B 98% 85% 91.0%

Classifier A is a better choice. F1-Score is not the only evaluation metric that can be use, the average, for
example, could also be an indicator of which classifier to use.
Satisficing and optimizing metric

There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices.
They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation
matrices must be evaluated on a training set, a development set or on the test set.

Example: Cat vs Non-cat

Classifier Accuracy Running time


A 90% 80 ms
B 92% 95 ms
C 95% 1 500 ms

In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric,
because you want the classifier to correctly detect a cat image as accurately as possible. The running time
which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has
to meet expectation set.

The general rule is:

1 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑖𝑛𝑔 𝑚𝑒𝑡𝑟𝑖𝑐
𝑁𝑚𝑒𝑡𝑟𝑖𝑐 : {
𝑁𝑚𝑒𝑡𝑟𝑖𝑐 − 1 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑐𝑖𝑛𝑔 𝑚𝑒𝑡𝑟𝑖𝑐
Training, development and test distributions

Setting up the training, development and test sets have a huge impact on productivity. It is important to
choose the development and test sets from the same distribution and it must be taken randomly from all
the data.

Guideline

Choose a development set and test set to reflect data you expect to get in the future and consider
important to do well.
Size of the development and test sets
Old way of splitting data
We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas
and models.

70 % 30 %

Training set Test set


Or

60 % 20 % 20 %

Training set Development set Test set

Modern era – Big data


Now, because a large amount of data is available, we don’t have to compromised as much and can use a
greater portion to train the model.

98 % 1% 1%
1

Training set Development set Test set

Guidelines
• Set up the size of the test set to give a high confidence in the overall performance of the system.
• Test set helps evaluate the performance of the final classifier which could be less 30% of the whole
data set.
• The development set has to be big enough to evaluate different ideas.
When to change development/test sets and metrics
Example: Cat vs Non-cat
A cat classifier tries to find a great amount of cat images to show to cat loving users. The evaluation metric
used is a classification error.

Algorithm Classification error [%]


A 3%
B 5%

It seems that Algorithm A is better than Algorithm B since there is only a 3% error, however for some reason,
Algorithm A is letting through a lot of the pornographic images.

Algorithm B has 5% error thus it classifies fewer images but it doesn't have pornographic images. From a
company's point of view, as well as from a user acceptance point of view, Algorithm B is actually a better
algorithm. The evaluation metric fails to correctly rank order preferences between algorithms. The evaluation
metric or the development set or test set should be changed.

The misclassification error metric can be written as a function as follow:


𝑚𝑑𝑒𝑣
1
𝐸𝑟𝑟𝑜𝑟 ∶ ∑ ℒ{(𝑦̂ (𝑖) ≠ 𝑦 (𝑖) }
𝑚𝑑𝑒𝑣
𝑖=1

This function counts up the number of misclassified examples.

The problem with this evaluation metric is that it treats pornographic vs non-pornographic images equally. On
way to change this evaluation metric is to add the weight term 𝑤 (𝑖) .

1 𝑖𝑓 𝑥 (𝑖) 𝑖𝑠 𝑛𝑜𝑛 − 𝑝𝑜𝑟𝑛𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐


𝑤 (𝑖) = {
10 𝑖𝑓 𝑥 (𝑖) 𝑖𝑠 𝑝𝑜𝑟𝑛𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐

The function becomes:


𝑚𝑑𝑒𝑣
1
𝐸𝑟𝑟𝑜𝑟 ∶ ∑ 𝑤 (𝑖) ℒ{(𝑦̂ (𝑖) ≠ 𝑦 (𝑖) }
∑ 𝑤 (𝑖)
𝑖=1

Guideline

1. Define correctly an evaluation metric that helps better rank order classifiers
2. Optimize the evaluation metric
Why human-level performance?

Today, machine learning algorithms can compete with human-level performance since they are more
productive and more feasible in a lot of application. Also, the workflow of designing and building a
machine learning system, is much more efficient than before.

Moreover, some of the tasks that humans do are close to ‘’perfection’’, which is why machine learning
tries to mimic human-level performance.

The graph below shows the performance of humans and machine learning over time.

Bayes optimal
error

Machine
Learning
Human
s

The

Machine learning progresses slowly when it surpasses human-level performance. One of the reason is
that human-level performance can be close to Bayes optimal error, especially for natural perception
problem.

Bayes optimal error is defined as the best possible error. In other words, it means that any functions
mapping from x to y can’t surpass a certain level of accuracy.

Also, when the performance of machine learning is worse than the performance of humans, you can
improve it with different tools. They are harder to use once its surpasses human-level performance.

These tools are:

- Get labeled data from humans


- Gain insight from manual error analysis: Why did a person get this right?
- Better analysis of bias/variance.
Avoidable bias
By knowing what the human-level performance is, it is possible to tell when a training set is performing
well or not.

Example: Cat vs Non-Cat

Classification error (%)


Scenario A Scenario B
Humans 1 7.5
Training error 8 8
Development error 10 10

In this case, the human level error as a proxy for Bayes error since humans are good to identify images. If
you want to improve the performance of the training set but you can’t do better than the Bayes error
otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias
or variance avoidance tactics will improve the performance of the model.

Scenario A
There is a 7% gap between the performance of the training set and the human level error. It means that
the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we
use bias reduction technique such as training a bigger neural network or running the training set longer.

Scenario B
The training set is doing good since there is only a 0.5% difference with the human level error. The
difference between the training set and the human level error is called avoidable bias. The focus here is
to reduce the variance since the difference between the training error and the development error is 2%.
To resolve the issue, we use variance reduction technique such as regularization or have a bigger training
set.
Understanding human-level performance
Human-level error gives an estimate of Bayes error.

Example 1: Medical image classification


This is an example of a medical image classification in which the input is a radiology image and the output
is a diagnosis classification decision.

Classification error (%)


Typical human 3.0
Typical doctor 1.0
Experienced doctor 0.7
Team of experienced doctors 0.5

The definition of human-level error depends on the purpose of the analysis, in this case, by definition the
Bayes error is lower or equal to 0.5%.

Example 2: Error analysis


Classification error (%)
Scenario A Scenario B Scenario C
1 1
Human (proxy for Bayes error) 0.7 0.7 0.5
0.5 0.5
Training error 5 1 0.7
Development error 6 5 0.8
Scenario A
In this case, the choice of human-level performance doesn’t have an impact. The avoidable bias is between
4%-4.5% and the variance is 1%. Therefore, the focus should be on bias reduction technique.

Scenario B
In this case, the choice of human-level performance doesn’t have an impact. The avoidable bias is between
0%-0.5% and the variance is 4%. Therefore, the focus should be on variance reduction technique.

Scenario C
In this case, the estimate for Bayes error has to be 0.5% since you can’t go lower than the human-level
performance otherwise the training set is overfitting. Also, the avoidable bias is 0.2% and the variance is
0.1%. Therefore, the focus should be on bias reduction technique.

Summary of bias/variance with human-level performance


• Human - level error – proxy for Bayes error
• If the difference between human-level error and the training error is bigger than the difference
between the training error and the development error. The focus should be on bias reduction
technique
• If the difference between training error and the development error is bigger than the difference
between the human-level error and the training error. The focus should be on variance reduction
technique
Surpassing human-level performance
Example1: Classification task

Classification error (%)


Scenario A Scenario B
Team of humans 0.5 0.5
One human 1.0 1
Training error 0.6 0.3
Development error 0.8 0.4
Scenario A
In this case, the Bayes error is 0.5%, therefore the available bias is 0.1% et the variance is 0.2%.

Scenario B
In this case, there is not enough information to know if bias reduction or variance reduction has to be
done on the algorithm. It doesn’t mean that the model cannot be improve, it means that the conventional
ways to know if bias reduction or variance reduction are not working in this case.

There are many problems where machine learning significantly surpasses human-level performance,
especially with structured data:

• Online advertising
• Product recommendations
• Logistics (predicting transit time)
• Loan approvals
Improving your model performance

The two fundamental assumptions of supervised learning

There are 2 fundamental assumptions of supervised learning. The first one is to have a low avoidable bias
which means that the training set fits well. The second one is to have a low or acceptable variance which
means that the training set performance generalizes well to the development set and test set.

If the difference between human-level error and the training error is bigger than the difference between
the training error and the development error, the focus should be on bias reduction technique which are
training a bigger model, training longer or change the neural networks architecture or try various
hyperparameters search.

If the difference between training error and the development error is bigger than the difference between
the human-level error and the training error, the focus should be on variance reduction technique which
are bigger data set, regularization or change the neural networks architecture or try various
hyperparameters search.

Summary

• Train bigger model


• Train longer, better optimization algorithms
• Neural Networks architecture/hyperparameters search

• More data
• Regularization
• Neural Networks architecture/hyperparameters search

Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Error Analysis

Carrying out error


deeplearning.ai
analysis
Look at dev examples to evaluate ideas

Should you try to make your cat classifier do better on dogs?


Error analysis:
• Get ~100 mislabeled dev set examples.
• Count up how many are dogs.

Andrew Ng
Evaluate multiple ideas in parallel
Ideas for cat detection:
• Fix pictures of dogs being recognized as cats
• Fix great cats (lions, panthers, etc..) being misrecognized
• Improve performance on blurry images
Image
1
2
3
..
.
% of total
Andrew Ng
Error Analysis

Cleaning up
deeplearning.ai Incorrectly labeled
data
Incorrectly labeled examples

y 1 0 1 1 0 1 1

DL algorithms are quite robust to random errors in the


training set.

Andrew Ng
Error analysis
Incorrectly
Image Dog Great Cat Blurry Comments
labeled

Labeler missed cat
98 in background

99
Drawing of a cat;
100 Not a real cat.

% of total 8% 43% 61% 6%

Overall dev set error

Errors due incorrect labels

Errors due to other causes

Goal of dev set is to help you select between two classifiers A & B.
Andrew Ng
Correcting incorrect dev/test set examples

• Apply same process to your dev and test sets to


make sure they continue to come from the same
distribution
• Consider examining examples your algorithm got
right as well as ones it got wrong.
• Train and dev/test data may now come from
slightly different distributions.
Andrew Ng
Error Analysis

Build your first system


deeplearning.ai
quickly, then iterate
Speech recognition example

• Noisy background • Set up dev/test set


• Café noise and metric
• Car noise • Build initial
• Accented speech system quickly
Guideline:
• Far from microphone • Use Bias/Variance
Build your
• Young children’s first
speech analysis & Error
system quickly,
• Stuttering analysis to
• … prioritize next
then iterate steps.
Andrew Ng
Mismatched training
and dev/test data
Training and testing
deeplearning.ai on different
distributions
Cat app example
Data from webpages Data from mobile app

Andrew Ng
Speech recognition example

Training Dev/test

Purchased data Speech activated


rearview mirror
Smart speaker control

Voice keyboard

Andrew Ng
Mismatched training
and dev/test data

Bias and Variance with


deeplearning.ai mismatched data
distributions
Cat classifier example
Assume humans get ≈ 0% error.

Training error
Dev error

Training-dev set: Same


distribution as training
set, but not used for
training
Andrew Ng
Bias/variance on mismatched training and
dev/test sets

Andrew Ng
More general formulation

Andrew Ng
Mismatched training
and dev/test data

Addressing data
deeplearning.ai
mismatch
Addressing data mismatch
• Carry out manual error analysis to try to understand difference
between training and dev/test sets

• Make training data more similar; or collect more data similar to


dev/test sets

Andrew Ng
Artificial data synthesis

+ =

“The quick brown Car noise Synthesized


fox jumps in-car audio
over the lazy dog.”

Andrew Ng
Artificial data synthesis

Car recognition:

Andrew Ng
Learning from
multiple tasks

Transfer learning
deeplearning.ai
Transfer learning

x !"

x !"

Andrew Ng
When transfer learning makes sense

• Task A and B have the same input x.

• You have a lot more data for Task A than Task B.

• Low level features from A could be helpful for learning B.

Andrew Ng
Learning from
multiple tasks

Multi-task
deeplearning.ai
learning
Simplified autonomous driving example

Andrew Ng
Neural network architecture

x !"

Andrew Ng
When multi-task learning makes sense
• Training on a set of tasks that could benefit from having
shared lower-level features.
• Usually: Amount of data you have for each task is quite
similar.

• Can train a big enough neural network to do well on all


the tasks.

Andrew Ng
End-to-end deep
learning
What is
deeplearning.ai end-to-end
deep learning
What is end-to-end learning?
Speech recognition example

Andrew Ng
Face recognition

[Image courtesy of Baidu]

Andrew Ng
More examples

Machine translation

Estimating child’s age:

Andrew Ng
End-to-end deep
learning

Whether to use
deeplearning.ai
end-to-end learning
Pros and cons of end-to-end deep learning
Pros:
• Let the data speak
• Less hand-designing of components needed

Cons:
• May need large amount of data
• Excludes potentially useful hand-designed
components
Andrew Ng
Applying end-to-end deep learning
Key question: Do you have sufficient data to learn
a function of the complexity needed to map x to y?

Andrew Ng
Build system quickly, then iterate
Depending on the area of application, the guideline below will help you prioritize when you build your
system.

Guideline
1. Set up development/ test set and metrics
- Set up a target
2. Build an initial system quickly
- Train training set quickly: Fit the parameters
- Development set: Tune the parameters
- Test set: Assess the performance
3. Use Bias/Variance analysis & Error analysis to prioritize next steps
Training and testing on different distributions
Example: Cat vs Non-cat
In this example, we want to create a mobile application that will classify and recognize pictures of cats
taken and uploaded by users.

There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000
pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not
professionally shot, not well framed and blurrier. The second source is from the web, you downloaded
200 000 pictures where cat’s pictures are professionally framed and in high resolution.

The problem is that you have a different distribution:

1- small data set from pictures uploaded by users. This distribution is important for the mobile app.
2- bigger data set from the web.

The guideline used is that you have to choose a development set and test set to reflect data you expect
to get in the future and consider important to do well.

The data is split as follow:

Web App App

205 000 5 000 5 000 5 000


1

Training set Development set Test set

The advantage of this way of splitting up is that the target is well defined.

The disadvantage is that the training distribution is different from the development and test set
distributions. However, this way of splitting the data has a better performance in long term.
Bias and variance with mismatched data distributions
Example: Cat classifier with mismatch data distribution
When the training set is from a different distribution than the development and test sets, the method to
analyze bias and variance changes.

Classification error (%)


Scenario A Scenario B Scenario C Scenario D Scenario E Scenario F
Human (proxy for Bayes error) 0 0 0 0 0 4
Training error 1 1 1 10 10 7
Training-development error - 9 1.5 11 11 10
Development error 10 10 10 12 20 6
Test error - - - - - 6

Scenario A
If the development data comes from the same distribution as the training set, then there is a large
variance problem and the algorithm is not generalizing well from the training set.
However, since the training data and the development data come from a different distribution, this
conclusion cannot be drawn. There isn't necessarily a variance problem. The problem might be that the
development set contains images that are more difficult to classify accurately.
When the training set, development and test sets distributions are different, two things change at the
same time. First of all, the algorithm trained in the training set but not in the development set. Second of
all, the distribution of data in the development set is different.

It's difficult to know which of these two changes what produces this 9% increase in error between the
training set and the development set. To resolve this issue, we define a new subset called training-
development set. This new subset has the same distribution as the training set, but it is not used for
training the neural network.

Scenario B

The error between the training set and the training- development set is 8%. In this case, since the training
set and training-development set come from the same distribution, the only difference between them is
the neural network sorted the data in the training and not in the training development. The neural
network is not generalizing well to data from the same distribution that it hadn't seen before
Therefore, we have really a variance problem.

Scenario C
In this case, we have a mismatch data problem since the 2 data sets come from different distribution.

Scenario D
In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.

Scenario E
In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference
between Bayes error and training error is 10 % and the second one is a data mismatched problem.

Scenario F
Development should never be done on the test set. However, the difference between the development
set and the test set gives the degree of overfitting to the development set.
General formulation

Bayes error

Avoidable Bias

Training set error

Variance

Development - Training set error

Data mismatch

Development set error

Degree of overfitting to the development set

Test set error


Addressing data mismatch

This is a general guideline to address data mismatch:

• Perform manual error analysis to understand the error differences between training,
development/test sets. Development should never be done on test set to avoid overfitting.

• Make training data or collect data similar to development and test sets. To make the training data
more similar to your development set, you can use is artificial data synthesis. However, it is
possible that if you might be accidentally simulating data only from a tiny subset of the space of
all possible examples.
Transfer Learning
Transfer learning refers to using the neural network knowledge for another application.

When to use transfer learning


• Task A and B have the same input 𝑥
• A lot more data for Task A than Task B
• Low level features from Task A could be helpful for Task B

Example 1: Cat recognition - radiology diagnosis


The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis.
The neural network will learn about the structure and the nature of images. This initial phase of training
on image recognition is called pre-training, since it will pre-initialize the weights of the neural network.
Updating all the weights afterwards is called fine-tuning.

For cat recognition


Input 𝑥: image
Output 𝑦 – 1: cat, 0: no cat

Images Image recognition - cat

Radiology diagnosis
Input 𝑥: Radiology images – CT Scan, X-rays
Output 𝑦 :Radiology diagnosis – 1: tumor malign, 0: tumor benign

Radiology images
𝑥

Radiology diagnosis
𝑦ො
Guideline
• Delete last layer of neural network
• Delete weights feeding into the last output layer of the neural network
• Create a new set of randomly initialized weights for the last layer only
• New data set (𝑥, 𝑦)
Multi-task learning
Multi-task learning refers to having one neural network do simultaneously several tasks.

When to use multi-task learning


• Training on a set of tasks that could benefit from having shared lower-level features
• Usually: Amount of data you have for each task is quite similar
• Can train a big enough neural network to do well on all tasks

Example: Simplified autonomous vehicle


The vehicle has to detect simultaneously several things: pedestrians, cars, road signs, traffic lights, cyclists,
etc. We could have trained four separate neural networks, instead of train one to do four tasks. However,
in this case, the performance of the system is better when one neural network is trained to do four tasks
than training four separate neural networks since some of the earlier features in the neural network could
be shared between the different types of objects.

The input 𝑥 (𝑖) is the image with multiple labels


The output 𝑦 (𝑖) has 4 labels which are represents:
0 Pedestrians
(𝑖) 1 Cars
𝑦 = [ ]
1 Road signs - Stop
0 Traffic lights

| | | | 𝑌 = (4, 𝑚)
𝑌 = [𝑦 (1) (2) (3) (4) ]
𝑦 𝑦 𝑦
𝑌 = (4,1)
| | | |

Neural Network architecture Pedestrians

Cars

Road signs - Stop


To train this neural network, loss function is defined as follow: Traffic lights
𝑚 4
1 (𝑖) (𝑖) (𝑖) (𝑖)
− ∑ ∑ (𝑦𝑗 log (𝑦̂𝑗 ) + (1 − 𝑦𝑗 ) log (1 − 𝑦̂𝑗 ))
𝑚
𝑖=1 𝑗=1

Also, the cost can be compute such as it is not influenced by the fact that some entries are not labeled.
Example:
1 0 ? ?
0 1 ? 0
𝑌=[ ]
0 1 ? 1
? 0 1 0
What is end-to-end deep learning
End-to-end deep learning is the simplification of a processing or learning systems into one neural
network.

Example - Speech recognition model


The traditional way - small data set

Audio Extract Features Phonemes Words Transcript

The hybrid way - medium data set

Audio Phonemes Words Transcript

The End-to-End deep learning way – large data set

Audio Transcript

End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used
mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving
cars, etc.
Whether to use end-to-end deep learning

Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have
enough data to learn a function of the complexity needed to map x and y?

Pro:
• Let the data speak
- By having a pure machine learning approach, the neural network will learn from x to y. It will
be able to find which statistics are in the data, rather than being forced to reflect human
preconceptions.

• Less hand-designing of components needed


- It simplifies the design work flow.

Cons:
• Large amount of labeled data
- It cannot be used for every problem as it needs a lot of labeled data.

• Excludes potentially useful hand-designed component


- Data and any hand-design’s components or features are the 2 main sources of knowledge for a
learning algorithm. If the data set is small than a hand-design system is a way to give manual
knowledge into the algorithm.
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Convolutional
Neural Networks

Computer vision
deeplearning.ai
Computer Vision Problems
Image Classification Neural Style Transfer

Cat? (0/1)

64x64

Object detection

Andrew Ng
Deep Learning on large images

Cat? (0/1)

64x64

!"
!#
⋮ ⋮ ⋮ %&
!$
Andrew Ng
Convolutional
Neural Networks

Edge detection
deeplearning.ai
example
Computer Vision Problem

vertical edges

horizontal edges Andrew Ng


Vertical edge detection
1 10 -1
10 10
-1 -1
0 -1
3 0 1 2 7 4
1 10 10
-1 10
-1 -1
0 -1
1 5 8 9 3 1
1
1 110 110
-1 1100
-1 00
-1
-1 -1
-1
2 7 2 5 1 3
1 10 -1
10 10
-1 0
-1 -1 ∗ =
0 1 3 1 7 8 0 -2 -4 -7
4 2 1 6 2 8 -3 -2 -3 -16
2 4 5 2 3 9

Andrew Ng
Vertical edge detection
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
∗ 1 0 -1 =
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0


Andrew Ng
Convolutional
Neural Networks

More edge
deeplearning.ai
detection
Vertical edge detection examples
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0

0 0 0 10 10 10
0 0 0 10 10 10 0 -30 -30 0
1 0 -1
0 0 0 10 10 10 0 -30 -30 0
0 0 0 10 10 10
∗ 1 0 -1 =
0 -30 -30 0
1 0 -1
0 0 0 10 10 10 0 -30 -30 0
0 0 0 10 10 10

Andrew Ng
Vertical and Horizontal Edge Detection
1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1
Vertical Horizontal
10 10 10 0 0 0
10 10 10 0 0 0 0 0 0 0
1 1 1
10 10 10 0 0 0 30 10 -10 -30
∗ 0 0 0 =
0 0 0 10 10 10 30 10 -10 -30
-1 -1 -1
0 0 0 10 10 10 0 0 0 0
0 0 0 10 10 10
Andrew Ng
Learning to detect edges
1 0 -1
1 0 -1
1 0 -1

3 0 1 2 7 4
1 5 8 9 3 1
#$ #% #&
2 7 2 5 1 3
#' #( #)
0 1 3 1 7 8
#* #+ #,
4 2 1 6 2 8
2 4 5 2 3 9
Andrew Ng
Convolutional
Neural Networks

Padding
deeplearning.ai
Padding

∗ =

Andrew Ng
Valid and Same convolutions

“Valid”:

“Same”: Pad so that output size is the same


as the input size.

Andrew Ng
Convolutional
Neural Networks

Strided
deeplearning.ai
convolutions
Strided convolution
2 3 3 4 7 43 4 4 6 34 2 4 9 4
6 1 6 0 9 21 8 0 7 12 4 0 3 2
3 -13 4 40 8 -143 3 40 8 -134 9 40 7 43 3 4 4
7 1 8 0 3 21 6 0 6 12 3 0 4 2 ∗ 1 0 2 =
4 -13 2 04 1 -134 8 40 3 -134 4 40 6 43 -1 0 3
3 1 2 0 4 12 1 0 9 12 8 0 3 2
0 -1 1 0 3 -13 9 0 2 -13 1 0 4 3

Andrew Ng
Summary of convolutions

& × & image #× # filter

padding p stride s

'()* +, '()* +,
+1 × +1
- -

Andrew Ng
Technical note on cross-correlation vs.
convolution
Convolution in math textbook:
2 3 7 4 6 2
6 6 9 8 7 4
3 4 5
3 4 8 3 8 9
∗ 1 0 2
7 8 3 6 6 3
-1 9 7
4 2 1 8 3 4
3 2 4 1 9 8

Andrew Ng
Convolutional
Neural Networks

Convolutions over
deeplearning.ai
volumes
Convolutions on RGB images

Andrew Ng
Convolutions on RGB image

∗ =

4x4

Andrew Ng
Multiple filters

∗ =

3x3x3 4x4

6x6x3 ∗ =

3x3x3
4x4

Andrew Ng
Convolutional
Neural Networks

One layer of a
deeplearning.ai
convolutional
network
Example of a layer


3x3x3

6x6x3

3x3x3

Andrew Ng
Number of parameters in one layer

If you have 10 filters that are 3 x 3 x 3


in one layer of a neural network, how
many parameters does that layer have?

Andrew Ng
Summary of notation
If layer l is a convolution layer:
#
" = filter size Input:
$ # = padding Output:
#
% = stride
#
&' = number of filters
Each filter is:
Activations:
Weights:
bias:
Andrew Ng
Convolutional
Neural Networks

A simple convolution
deeplearning.ai
network example
Example ConvNet

Andrew Ng
Types of layer in a convolutional network:

- Convolution
- Pooling
- Fully connected

Andrew Ng
Convolutional
Neural Networks

Pooling layers
deeplearning.ai
Pooling layer: Max pooling

1 3 2 1
2 9 1 1
1 3 2 3
5 6 1 2

Andrew Ng
Pooling layer: Max pooling

1 3 2 1 3
2 9 1 1 5
1 3 2 3 2
8 3 5 1 0
5 6 1 2 9

Andrew Ng
Pooling layer: Average pooling

1 3 2 1
2 9 1 1
1 4 2 3
5 6 1 2

Andrew Ng
Summary of pooling
Hyperparameters:

f : filter size
s : stride
Max or average pooling

Andrew Ng
Convolutional
Neural Networks

Convolutional neural
deeplearning.ai
network example
Neural network example

Andrew Ng
608

3216

48120
10164
850
Convolutional
Neural Networks

Why convolutions?
deeplearning.ai
Why convolutions

Andrew Ng
Why convolutions
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
∗ 1 0 -1 =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0

Parameter sharing: A feature detector (such as a vertical


edge detector) that’s useful in one part of the image is probably
useful in another part of the image.

Sparsity of connections: In each layer, each output value


depends only on a small number of inputs.
Andrew Ng
Putting it together
Training set (% & , ( & ) … (% + ,( + ).

(1

+
&
Cost , = + - ℒ((1 . , ( . )
./&

Use gradient descent to optimize parameters to reduce ,

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Case Studies

Why look at
deeplearning.ai
case studies?
Outline
Classic networks:
• LeNet-5
• AlexNet
• VGG

ResNet

Inception

Andrew Ng
Case Studies

Classic networks
deeplearning.ai
LeNet - 5
avg pool avg pool


"#
5×5 f=2 5×5 f=2 ⋮
s=1 s=2 s=1 s=2

32×32 ×1 28×28×6 14×14×6 10×10×16 5×5×16


120 84

[LeCun et al., 1998. Gradient-based learning applied to document recognition] Andrew Ng


AlexNet
MAX-POOL MAX-POOL

11 × 11 3×3 5×5 3×3


s=4 s=2 same s=2
55×55 ×96 27×27 ×96 27×27 ×256 13×13 ×256
227×227 ×3

MAX-POOL
= ⋮ ⋮ ⋮
3×3 3×3 3×3 3×3
s=2
Softmax
same
1000
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 9216 4096 4096

[Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks] Andrew Ng
VGG - 16
CONV = 3×3 filter, s = 1, same MAX-POOL = 2×2 , s = 2

224×224×64 112×112 ×64 112×112 ×128 56×56 ×128


[CONV 64] POOL [CONV 128] POOL
×2 ×2

224×224 ×3

56×56 ×256 28×28 ×256 28×28 ×512 14×14×512


[CONV 256] POOL [CONV 512] POOL
×3 ×3

14×14 ×512 7×7×512 FC FC Softmax


[CONV 512] POOL 4096 4096 1000
×3

[Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition] Andrew Ng
Case Studies

Residual Networks
deeplearning.ai
(ResNets)
Residual block
![#%(]
![#] ![#%&]

' [#%(] = * [#%(] ![#] + , [#%(] ![#%(] = -(' [#%(] ) ' [#%&] = * [#%&] ![#%(] + , [#%&] ![#%&] = -(' [#%&] )

[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Residual Network

x ![#]

Plain ResNet
training error

training error
# layers # layers
[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Case Studies

Why ResNets
deeplearning.ai
work
Why do residual networks work?

Andrew Ng
ResNet
Plain

ResNet

[He et al., 2015. Deep residual networks for image recognition] Andrew Ng
Case Studies

Network in Network
deeplearning.ai
and 1×1 convolutions
Why does a 1 × 1 convolution do?
1 2 3 6 5 8
3 5 5 1 3 4
2 1 3 4 9 3
4 7 8 5 7 9
∗ 2 =
1 5 3 7 4 8
5 4 9 8 3 5
6×6

∗ =

6 × 6 × 32 1 × 1 × 32 6 × 6 × # filters
[Lin et al., 2013. Network in network] Andrew Ng
Using 1×1 convolutions
ReLU

CONV 1 × 1
32
28 × 28 × 32
28 × 28 × 192

[Lin et al., 2013. Network in network] Andrew Ng


Case Studies

Inception network
deeplearning.ai
motivation
Motivation for inception network
1×1

3×3
64

128
5×5 28
32
32
28
28 × 28 × 192 MAX-POOL

[Szegedy et al. 2014. Going deeper with convolutions] Andrew Ng


The problem of computational cost

CONV
5 × 5,
same,
32 28 × 28 × 32
28 × 28 × 192

Andrew Ng
Using 1×1 convolution

CONV CONV
1 × 1, 5 × 5,
16, 32,
1 × 1 × 192 28 × 28 × 16 5 × 5 × 16 28 × 28 × 32
28 × 28 × 192

Andrew Ng
Case Studies

Inception network
deeplearning.ai
Inception module
1×1
CONV

1×1 3×3
CONV CONV
Previous Channel
Activation Concat
1×1 5×5
CONV CONV
MAXPOOL
3 × 3,s = 1
1×1
same CONV
Andrew Ng
Inception network

[Szegedy et al., 2014, Going Deeper with Convolutions] Andrew Ng


http://knowyourmeme.com/memes/we-need-to-go-deeper Andrew Ng
Convolutional
Neural Networks

MobileNet
Motivation for MobileNets
• Low computational cost at deployment
• Useful for mobile and embedded vision
applications
• Key idea: Normal vs. depthwise-
separable convolutions

[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications] Andrew Ng
Normal Convolution

* =

3x3x3
4 4x x4 4x 5
6x6x3

Computational cost = #filter params x # filter positions x # of filters

Andrew Ng
Depthwise Separable Convolution
Normal Convolution

* =

Depthwise Separable Convolution

* * =

Depthwise Pointwise

Andrew Ng
Depthwise Convolution

* =

3x3 4x4x3
6x6x3

Computational cost = #filter params x # filter positions x # of filters

Andrew Ng
Depthwise Separable Convolution
Depthwise Convolution

* =

Pointwise Convolution

* =

Andrew Ng
Pointwise Convolution

* =

1x1x3

4x4x3 4 x4 4x x4 5

Computational cost = #filter params x # filter positions x # of filters

Andrew Ng
Depthwise Separable Convolution
Normal Convolution

* =

Depthwise Separable Convolution

* * =

Depthwise Pointwise

Andrew Ng
Cost Summary
Cost of normal convolution

Cost of depthwise separable convolution

[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications] Andrew Ng
Depthwise Separable Convolution
Depthwise Convolution

* =

Pointwise Convolution

* =

Andrew Ng
Convolutional
Neural Networks

MobileNet
Architecture
MobileNet
MobileNet v1

MobileNet v2
Residual Connection

Expansion Depthwise Projection

[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet v2 Bottleneck
Residual Connection

Expansion Depthwise Pointwise

[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet
MobileNet v1

MobileNet v2
Residual Connection

Expansion Depthwise Projection

[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
MobileNet v2 Full Architecture

conv2d avgpool conv2d


conv2d 1x1 7x7 1x1

[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks] Andrew Ng
Convolutional
Neural Networks

EfficientNet
EfficientNet
Baseline

𝑦ො

Wider
Higher
Deeper Resolution
Compound scaling

𝑦ො 𝑦ො 𝑦ො 𝑦ො

[Tan and Le, 2019, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks] Andrew Ng
Practical advice for
using ConvNets

Transfer Learning
deeplearning.ai
Practical advice for
using ConvNets

Data augmentation
deeplearning.ai
Common augmentation method
Mirroring

Random Cropping Rotation


Shearing
Local warping

Andrew Ng
Color shifting

+20,-20,+20

-20,+20,+20

+5,0,+50

Andrew Ng
Implementing distortions during training

Andrew Ng
Practical advice for
using ConvNets

The state of
deeplearning.ai
computer vision
Data vs. hand-engineering

Two sources of knowledge


• Labeled data
• Hand engineered features/network architecture/other components
Andrew Ng
Tips for doing well on benchmarks/winning
competitions
Ensembling
• Train several networks independently and average their outputs

Multi-crop at test time


• Run classifier on multiple versions of test images and average results

Andrew Ng
Use open source code

• Use architectures of networks published in the literature

• Use open source implementations if possible

• Use pretrained models and fine-tune on your dataset

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Object Detection

Object
deeplearning.ai
localization
What are localization and detection?
Image classification Classification with Detection
localization

Andrew Ng
Classification with localization

⋯ ⋮

1- pedestrian
2- car
3- motorcycle
4- background
Andrew Ng
Defining the target label y
1- pedestrian Need to output #$ , #& , #' , #( , class label (1-4)
2- car
3- motorcycle
4- background

Andrew Ng
Object Detection

Landmark
deeplearning.ai
detection
Landmark detection ConvNet

!" , !$ , !% , !&

Andrew Ng
Object Detection

Object
deeplearning.ai
detection
Car detection example
Training set:
x y
1

Andrew Ng
Sliding windows detection

Andrew Ng
Object Detection

Convolutional
deeplearning.ai implementation of
sliding windows
Turning FC layer into convolutional layers

MAX POOL FC FC

5×5 2×2 ⋮ ⋮
y
14 × 14 × 3 10 × 10 × 16 5 × 5 × 16 400 400 softmax (4)

MAX POOL FC FC

5×5 2×2 5×5 1×1

14 × 14 × 3 10 × 10 × 16 5 × 5 × 16 1 × 1× 400 1 × 1 × 400 1×1×4

Andrew Ng
Convolution implementation of sliding windows
MAX POOL FC FC FC

5×5 2×2 5×5 1×1 1×1

14×14 ×3 10×10×16 5×5×16 1×1×400 1×1×400 1×1×4

MAX POOL FC FC FC

5×5 2×2 5×5 1×1 1×1

16×16×3 12×12×16 6×6×16 2×2×400 2×2×400 2×2×4

MAX POOL

5×5 2×2 5×5 1×1 1×1

28×28×3 24×24×16 12×12×16 8×8×400 8×8×400 8×8×4


[Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks] Andrew Ng
Convolution implementation of sliding windows

MAX POOL

5×5 2×2 5×5 1×1 1×1

28×28 16×16 12×12 8×8×400 8×8×400 8×8×4

Andrew Ng
Object Detection

Bounding box
deeplearning.ai
predictions
Output accurate bounding boxes

Andrew Ng
YOLO algorithm
Labels for training
For each grid cell:

100

100

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Specify the bounding boxes

0.5
100 0.9

100

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Object Detection

Intersection
deeplearning.ai
over union
Evaluating object localization

“Correct” if IoU ≥ 0.5

More generally, IoU is a measure of the overlap between two bounding boxes.
Andrew Ng
Object Detection

Non-max
deeplearning.ai
suppression
Non-max suppression example

Andrew Ng
Non-max suppression example

0.6
0.8

0.9
0.3
0.5

Andrew Ng
Non-max suppression example

0.6
0.8

0.9
0.7
0.7

Andrew Ng
Non-max suppression algorithm
$%
&'
Each output prediction is: &(
&)
&*
Discard all boxes with $% ≤ 0.6
While there are any remaining boxes:
• Pick the box with the largest $%
Output that as a prediction.
19×19
• Discard any remaining box with
IoU ≥ 0.5 with the box output
in the previous step Andrew Ng
Object Detection

Anchor boxes
deeplearning.ai
Overlapping objects:
Anchor box 1: Anchor box 2:

!"
#$
#%
#&
y = #'
()
(*
(+
[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Anchor box algorithm
Previously: With two anchor boxes:
Each object in training Each object in training
image is assigned to grid image is assigned to grid
cell that contains that cell that contains object’s
object’s midpoint. midpoint and anchor box
for the grid cell with
highest IoU.

Andrew Ng
Anchor box example !"
#$
#%
#&
#'
()
(*
(+
y = !"
#$
#%
#&
#'
Anchor box 1: Anchor box 2:
()
(*
(+
Andrew Ng
Object Detection

Putting it together:
deeplearning.ai
YOLO algorithm
Training 1 - pedestrian
'( 0 0
2 - car )* ? ?
)+ ? ?
3 - motorcycle
)- ? ?
). ? ?
/0 ? ?
/1 ? ?
/2 ? ?
y = '( 0 1
)* ? )*
)+ ? )+
)- ? )-
). ?
/0
).
? 0
/1 ?
/2 1
? 0
y is 3×3×2×8

[Redmon et al., 2015, You Only Look Once: Unified real-time object detection] Andrew Ng
Making predictions
'(
)*
)+
)-
).
/0
⋯ 4= /1
/2
'(
)*
3×3×2×8 )+
)-
).
/0
/1
/2

Andrew Ng
Outputting the non-max supressed outputs

• For each grid call, get 2 predicted bounding


boxes.

• Get rid of low probability predictions.

• For each class (pedestrian, car, motorcycle)


use non-max suppression to generate final
predictions.

Andrew Ng
Object Detection

Region proposals
deeplearning.ai
(Optional)
Region proposal: R-CNN

[Girshik et. al, 2013, Rich feature hierarchies for accurate object detection and semantic segmentation] Andrew Ng
Faster algorithms

R-CNN: Propose regions. Classify proposed regions one at a


time. Output label + bounding box.

Fast R-CNN: Propose regions. Use convolution implementation


of sliding windows to classify all the proposed
regions.

Faster R-CNN: Use convolutional network to propose regions.

[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]
[Girshik, 2015. Fast R-CNN]
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks] Andrew Ng
Convolutional
Neural Networks

Semantic segmentation
with U-Net
Object Detection vs. Semantic Segmentation

Input image Object Detection Semantic Segmentation

Andrew Ng
Motivation for U-Net

Chest X-Ray Brain MRI

[Novikov et al., 2017, Fully Convolutional Architectures for Multi-Class Segmentation in Chest Radiographs]
[Dong et al., 2017, Automatic Brain Tumor Detection and Segmentation Using U-Net Based Fully Convolutional Networks ] Andrew Ng
Per-pixel class labels
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
1. Car
000000011111100000000000 0. Not Car
001111111111111100000000
001111111111111111111110
001111111111111111111110
000011100000000000111000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000

Andrew Ng
Per-pixel class labels
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
222222222222222222222222 222222222222222222222222
2 2 2 2 2 2 21 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1. Car 2 2 2 2 2 2 21 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2. Building 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3. Road 2 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
333311133333333333111 333 333311133333333333111 333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
333333333333333333333333 333333333333333333333333
Segmentation Map

Andrew Ng
Deep Learning for Semantic Segmentation

𝑦ො

Andrew Ng
Transpose Convolution
Normal Convolution

* =

Transpose Convolution

* =

Andrew Ng
Transpose Convolution
231 231 231
1 2 1
231 231 231
2 0 1 0 24
+2 0 1
231 231 231 2 +0
0 2 1 410
+6 7
+3+2
+4 1 3
26 +2
2 1
0
0 37
+4 0
0 2
2
3 2 weight filter
6 33
+0 4 2
2x2

4x4

filter f x f = 3 x 3 padding p = 1 stride s = 2


Andrew Ng
Deep Learning for Semantic Segmentation
Skip connection

𝑦ො

Andrew Ng
U-Net

Conv, RELU
Max Pool
Trans Conv
Skip Connection
Conv (1x1)

[Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation] Andrew Ng
U-Net

hxwx3 h x w x # classes

Conv, RELU
Max Pool
Trans Conv
Skip Connection
Conv (1x1)

[Ronneberger et al., 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation] Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Face recognition

What is face
deeplearning.ai
recognition?
Face recognition

[Courtesy of Baidu] Andrew Ng


Face verification vs. face recognition
Verification
• Input image, name/ID
• Output whether the input image is that of the
claimed person

Recognition
• Has a database of K persons
• Get an input image
• Output ID if the image is any of the K persons (or
“not recognized”)
Andrew Ng
Face recognition

One-shot learning
deeplearning.ai
One-shot learning
Learning from one
example to recognize the
person again

Andrew Ng
Learning a “similarity” function
d(img1,img2) = degree of difference between images
If d(img1,img2) ≤ -
> -

Andrew Ng
Face recognition

Siamese network
deeplearning.ai
Siamese network

⋮ ⋮

" ($)

⋮ ⋮

" (&)

[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Goal of learning

⋮ ⋮

f(" ($) )

Parameters of NN define an encoding ((" ) )


Learn parameters so that:
) + ) + &
If " , " are the same person, f " −f " is small.
) + ) + &
If " , " are different persons, f " −f " is large.

Andrew Ng
Face recognition

Triplet loss
deeplearning.ai
Learning Objective

Anchor Positive Anchor Negative

[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Loss function

Training set: 10k pictures of 1k persons

[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Choosing the triplets A,P,N

During training, if A,P,N are chosen randomly,


! ", $ + & ≤ !(", )) is easily satisfied.

Choose triplets that’re “hard” to train on.

[Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering] Andrew Ng
Training set using triplet loss
Anchor Positive Negative

⋮ ⋮ ⋮

Andrew Ng
Face recognition

Face verification and


deeplearning.ai
binary classification
Learning the similarity function

$ (%) f($ (%) ) ()

$ (') f($ (') )

[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Face verification supervised learning
$ (

[Taigman et. al., 2014. DeepFace closing the gap to human level performance] Andrew Ng
Neural Style
Transfer

What is neural style


deeplearning.ai
transfer?
Neural style transfer

Content Style Content Style

Generated image Generated image


[Images generated by Justin Johnson] Andrew Ng
Neural Style
Transfer

What are deep


deeplearning.ai
ConvNets learning?
Visualizing what a deep network is learning

⋮ ⋮ &'
26×26×256 13×13×256 13×13×384 13×13×384 6×6×256
55×55×96
FC FC
224×224×3 110×110×96 4096 4096

Pick a unit in layer 1. Find the nine


image patches that maximize the unit’s
activation.
Repeat for other units.

[Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks] Andrew Ng
Visualizing deep layers

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 1

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 2

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 3

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 3

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 4

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Visualizing deep layers: Layer 5

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Andrew Ng
Neural Style
Transfer

Cost function
deeplearning.ai
Neural style transfer cost function

Content C Style S

Generated image G
[Gatys et al., 2015. A neural algorithm of artistic style. Images on slide generated by Justin Johnson] Andrew Ng
Find the generated image G
1. Initiate G randomly
G: 100×100×3

2. Use gradient descent to minimize %(')

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Neural Style
Transfer

Content cost
deeplearning.ai
function
Content cost function
" # = % "'()*+)* ,, # + / "0*12+ (4, #)
• Say you use hidden layer ! to compute content cost.
• Use pre-trained ConvNet. (E.g., VGG network)
• Let 6[2](9) and 6[2](:) be the activation of layer !
on the images
• If 6[2](9) and 6[2](:) are similar, both images have
similar content

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Neural Style
Transfer

Style cost function


deeplearning.ai
Meaning of the “style” of an image
255 134 93 22
255 134 202 22
123 42
255 231 94 22
83 2
123 94 83 4

"#
34 83
123 94 44 187
2 30
34 44 187 192
34 44 187 92 124
34 76 232
34 76 232 34
67 232
346776 83 124
194 142

83 194 94
67 83 194 202

Say you are using layer $’s activation to measure “style.”


Define style as correlation between activations across channels.

How correlated are the activations


%' across different channels?
%(
%&

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Intuition about style of an image
Style image Generated Image

%' %'
%( %(
%& %&

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Style matrix
[/] [/] [/] [/]
Let a*,,,- = activation at 2, 3, 4 . 7 is n9 ×n9

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Style cost function
/ 1 / H / J
;<=>/? (A, 7) = E F F(7-- G − 7-- G )
[/] [/] [/]
2%' %& %( - -G

[Gatys et al., 2015. A neural algorithm of artistic style] Andrew Ng


Convolutional
Networks in 1D or 3D

1D and 3D
deeplearning.ai
generalizations of
models
Convolutions in 2D and 1D


2D filter
5×5
2D input image
14×14

1 20 15 3 18 12 4 17 1 3 10 3 1

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D data

Andrew Ng
3D convolution


3D filter

3D volume

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Recurrent Neural
Networks

Why sequence
deeplearning.ai
models?
Examples of sequence data
“The quick brown fox jumped
Speech recognition over the lazy dog.”

Music generation ∅
“There is nothing to like
Sentiment classification in this movie.”

DNA sequence analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG

Machine translation Voulez-vous chanter avec Do you want to sing with


moi? me?

Video activity recognition Running

Name entity recognition Yesterday, Harry Potter Yesterday, Harry Potter


met Hermione Granger. met Hermione Granger.
Andrew Ng
Recurrent Neural
Networks

Notation
deeplearning.ai
Motivating example
x: Harry Potter and Hermione Granger invented a new spell.

Andrew Ng
Representing words
x: Harry Potter and Hermione Granger invented a new spell.
! "#$ ! "%$ ! "&$ ⋯ ! "($

Andrew Ng
Representing words
x: Harry Potter and Hermione Granger invented a new spell.
! "#$ ! "%$ ! "&$ ⋯ ! "($

And = 367
Invented = 4700
A=1
New = 5976
Spell = 8376
Harry = 4075
Potter = 6830
Hermione = 4200
Gran… = 4000

Andrew Ng
Recurrent Neural
Networks

Recurrent Neural
deeplearning.ai
Network Model
Why not a standard network?
! "#$ ) "#$

! "%$ ) "%$

⋮ ⋮ ⋮ ⋮

! "'($ ) "'*$

Problems:
- Inputs, outputs can be different lengths in different examples.
- Doesn’t share features learned across different positions of text.
Andrew Ng
Recurrent Neural Networks

He said, “Teddy Roosevelt was a great President.”


He said, “Teddy bears are on sale!”
Andrew Ng
Forward Propagation
)- "#$ )- "%$ )- ".$ )- "'* $

+"#$ +"%$ +"'( /#$


+",$ ⋯

! "#$ ! "%$ ! ".$ ! "'( $

Andrew Ng
Simplified RNN notation
+"1$ = 3(566 +"1/#$ + 568 ! "1$ + 96 )

)- "1$ = 3(5;6 +"1$ + 9; )

Andrew Ng
Recurrent Neural
Networks

Backpropagation
deeplearning.ai
through time
Forward propagation and backpropagation
'( "&$ '( ")$ '( "*$ '( "+. $

!"&$ !")$ !"+, -&$


!"#$ ⋯

% "&$ % ")$ % "*$ % "+, $

Andrew Ng
Forward propagation and backpropagation

ℒ "1$ '( "1$ , ' "1$ =

Backpropagation through time


Andrew Ng
Recurrent Neural
Networks

Different types
deeplearning.ai
of RNNs
Examples of sequence data
“The quick brown fox jumped
Speech recognition over the lazy dog.”

Music generation ∅
“There is nothing to like
Sentiment classification in this movie.”

DNA sequence analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG

Machine translation Voulez-vous chanter avec Do you want to sing with


moi? me?

Video activity recognition Running

Name entity recognition Yesterday, Harry Potter Yesterday, Harry Potter


met Hermione Granger. met Hermione Granger.
Andrew Ng
Examples of RNN architectures

Andrew Ng
Examples of RNN architectures

Andrew Ng
Summary of RNN types
() #'% () #'% () #*% () #+, % ()
"#$% "#$% ⋯ "#$% ⋯
& #'% & & #'% & #*% & #+. %
One to one One to many Many to one

() #'% () #*% () #+, % () #'% () #+, %

"#$% "#$% ⋯ ⋯ ⋯

& #'% & #*% & #'% & #+. %


& #+. %
Many to many Many to many
Andrew Ng
Recurrent Neural
Networks

Language model and


deeplearning.ai
sequence generation
What is language modelling?
Speech recognition
The apple and pair salad.

The apple and pear salad.

!(The apple and pair salad) =

!(The apple and pear salad) =

Andrew Ng
Language modelling with an RNN
Training set: large corpus of english text.

Cats average 15 hours of sleep a day.

The Egyptian Mau is a bread of cat. <EOS>


Andrew Ng
RNN model

Cats average 15 hours of sleep a day. <EOS>

ℒ &' ()*, & ()* = − - &0()* log &'0()*


0
ℒ = - ℒ ()* &' ()*, & ()*
) Andrew Ng
Recurrent Neural
Networks

Sampling novel
deeplearning.ai
sequences
Sampling a sequence from a trained RNN
'( "&$ '( "/$ '( "0$ '( ")* $

!"#$ !"&$ !"/$ !"0$ ⋯ !")* $

% "&$ ' "&$ ' "/$ ' ")- .&$

Andrew Ng
Character-level language model
Vocabulary = [a, aaron, …, zulu, <UNK>]

'( "&$ '( "/$ '( "0$ '( ")* $

!"#$ !"&$ !"/$ !"0$ ⋯ !")* $

% "&$ '( "&$ '( "/$ '( ")- .&$


Andrew Ng
Sequence generation
News Shakespeare

President enrique peña nieto, announced The mortal moon hath her eclipse in love.
sench’s sulk former coming football langston
paring. And subject of this thou art another this fold.

“I was not at all surprised,” said hich langston. When besser be my love to me see sabl’s.

“Concussion epidemic”, to be examined. For whose are ruse of mine eyes heaves.

The gray football the told some and this has on


the uefa icon, should money as.

Andrew Ng
Recurrent Neural
Networks

Vanishing gradients
deeplearning.ai
with RNNs
Vanishing gradients with RNNs
'( "&$ '( "-$ '( "/$ '( ")* $

!"#$ !"&$ !"-$ !"/$ ⋯ !")* $

% "&$ % "-$ % "/$ % "). $

% ⋮ ⋮ ⋮ ⋮ ⋯ ⋮ ⋮ ⋮ '(

Exploding gradients.
Andrew Ng
Recurrent Neural
Networks

Gated Recurrent
deeplearning.ai
Unit (GRU)
RNN unit

!"#$ = &(() !"#*+$ , - "#$ + /) )

Andrew Ng
GRU (simplified)

The cat, which already ate …, was full.


[Cho et al., 2014. On the properties of neural machine translation: Encoder-decoder approaches]
[Chung et al., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling] Andrew Ng
Full GRU
5̃ "#$ = tanh((> [ 5 "#*+$ , - "#$ ] + /> )

Γ2 = 3((2 5 "#*+$ , - "#$ + /2 )

5 "#$ = Γ2 ∗ 5̃ "#$ + 1 − Γ2 + 5 "#*+$

The cat, which ate already, was full.

Andrew Ng
Recurrent Neural
Networks

LSTM (long short


deeplearning.ai
term memory) unit
GRU and LSTM
GRU LSTM
!̃ #$% = tanh(,- Γ/ ∗ ! #$12%, 4 #$% + 6- )

Γ8 = 9(,8 ! #$12%, 4 #$% + 68 )

Γ/ = 9(,/ ! #$12%, 4 #$% + 6/ )

! #$% = Γ8 ∗ !̃ #$% + 1 − Γ8 ∗ ! #$12%

=#$% = ! #$%

[Hochreiter & Schmidhuber 1997. Long short-term memory] Andrew Ng


LSTM units
GRU LSTM
!̃ #$% = tanh(,- Γ/ ∗ ! #$12%, 4 #$% + 6- ) !̃ #$% = tanh(,- =#$12%, 4 #$% + 6- )

Γ8 = 9(,8 ! #$12%, 4 #$% + 68 ) Γ8 = 9(,8 =#$12%, 4 #$% + 68 )

Γ/ = 9(,/ ! #$12%, 4 #$% + 6/ ) Γ> = 9(,> =#$12%, 4 #$% + 6> )

! #$% = Γ8 ∗ !̃ #$% + 1 − Γ8 ∗ ! #$12% Γ? = 9(,? =#$12%, 4 #$% + 6? )

=#$% = ! #$% ! #$% = Γ8 ∗ !̃ #$% + Γ> ∗ ! #$12%

=#$% = Γ? ∗ ! #$%
[Hochreiter & Schmidhuber 1997. Long short-term memory] Andrew Ng
LSTM in pictures
D #$%

!̃ #$% = tanh(,- =#$12%, 4 #$% + 6- )


softmax

=#$%
Γ8 = 9(,8 =#$12%, 4 #$% + 68 )
! #$12% * ⨁ ! #$%
--
Γ> = 9(,> =#$12%, 4 #$% + 6> ) tanh ! #$%
* =#$%
Γ? = 9(,? =#$12%, 4 #$% + 6? ) =#$12% B #$%
C #$%
!̃ #$% A #$%
*
=#$%
! #$% = Γ8 ∗ !̃ #$% + Γ> ∗ ! #$12%
forget gate update gate tanh output gate

=#$% = Γ? ∗ ! #$%
4 #$%
D #2% D #F% D #G%
softmax softmax softmax

=#2% =#F% =#G%


! #F%
-- --
! #G%
--
! #2%
! #E% * ⨁ ! #2% * ⨁ ! #F% * ⨁

=#E% #2% =#2% =#F%


=#F% =#G%
=

4 #2% 4 #F% 4 #G% Andrew Ng


Recurrent Neural
Networks

Bidirectional RNN
deeplearning.ai
Getting information from the future
He said, “Teddy bears are on sale!”
He said, “Teddy Roosevelt was a great President!”

!" #)% !" #(% !" #*% !" #.% !" #-% !" #/% !" #$%

+#,% +#)% +#(% +#*% +#.% +#-% +#/% +#$%

' #)% ' #(% ' #*% ' #.% ' #-% ' #/% ' #$%
He said, “Teddy bears are on sale!”

Andrew Ng
Bidirectional RNN (BRNN)

Andrew Ng
Recurrent Neural
Networks

Deep RNNs
deeplearning.ai
Deep RNN example
, "#$ , "%$ , "&$ , "'$

([&]"+$ ([&]"#$ ([&]"%$ ([&]"&$ ([&]"'$

([%]"+$ ([%]"#$ ([%]"%$ ([%]"&$ ([%]"'$

([#]"+$

! "#$ ! "%$ ! "&$ ! "'$

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


NLP and Word
Embeddings

Word representation
deeplearning.ai
Word representation
V = [a, aaron, …, zulu, <UNK>]

1-hot representation

Man Woman King Queen Apple Orange


I want a glass of orange ______.
(5391) (9853) (4914) (7157) (456) (6257)
0 0 0 0 0 0 I want a glass of apple______.
0 0 0 0 ⋮ 0
0 0 0 0 1 0
0 0 ⋮ 0 ⋮ 0
⋮ 0 1 0 0 0
1 ⋮ ⋮ ⋮ 0 ⋮
⋮ 1 0 1 0 1
0 ⋮ 0 ⋮ 0 ⋮
0 0 0 0 0 0

Andrew Ng
Featurized representation: word embedding
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)

-0.95 0.97 0.00 0.01

0.93 0.95 -0.01 0.00

0.7 0.69 0.03 -0.02

0.02 0.01 0.95 0.97

I want a glass of orange ______.


I want a glass of apple______.
Andrew Ng
Visualizing word embeddings

man
woman
dog
king
cat
queen fish

apple
grape
three four
one orange
two

[van der Maaten and Hinton., 2008. Visualizing data using t-SNE] Andrew Ng
NLP and Word
Embeddings

Using word
deeplearning.ai
embeddings
Named entity recognition example

1 1 0 0 0 0

Sally Johnson is an orange farmer

Robert Lin is an apple farmer

Andrew Ng
Transfer learning and word embeddings

1. Learn word embeddings from large text corpus. (1-100B words)

(Or download pre-trained embedding online.)

2. Transfer embedding to new task with smaller training set.


(say, 100k words)

3. Optional: Continue to finetune the word embeddings with new


data.

Andrew Ng
Relation to face encoding

$ (&) f($ (&) )


⋮ )*

$ (() f($ (() )

[Taigman et. al., 2014. DeepFace: Closing the gap to human level performance] Andrew Ng
NLP and Word
Embeddings

Properties of word
deeplearning.ai
embeddings
Analogies
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
Gender −1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.70 0.69 0.03 -0.02
Food 0.09 0.01 0.02 0.01 0.95 0.97

[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations] Andrew Ng
Analogies using word vectors man
woman dog
king
cat
queen fish

three four apple


grape
one
two orange

()*+ − (,-)*+ ≈ (/0+1 − (?

Andrew Ng
Cosine similarity
345((, , (/0+1 − ()*+ + (,-)*+ )

Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia

Andrew Ng
NLP and Word
Embeddings

Embedding matrix
deeplearning.ai
Embedding matrix

In practice, use specialized function to look up an embedding.


Andrew Ng
NLP and Word
Embeddings

Learning word
deeplearning.ai
embeddings
Neural language model
I want a glass of orange ______.
4343 9665 1 3852 6163 6257

I *+,+, 4 5+,+,

want *-../ 4 5-../

a *0 4 50

glass *,1/2 4 5,1/2

of *.0., 4 5.0.,

orange *.2/3 4 5.2/3


[Bengio et. al., 2003, A neural probabilistic language model] Andrew Ng
Other context/target pairs
I want a glass of orange juice to go along with my cereal.

Context: Last 4 words.

4 words on left & right

Last 1 word

Nearby 1 word

Andrew Ng
NLP and Word
Embeddings

Word2Vec
deeplearning.ai
Skip-grams
I want a glass of orange juice to go along with my cereal.

[Mikolov et. al., 2013. Efficient estimation of word representations in vector space.] Andrew Ng
Model
Vocab size = 10,000k

Andrew Ng
Problems with softmax classification

&'( )*
%
! "# =
-.,... &,( )*
∑01- %

How to sample the context #?

Andrew Ng
NLP and Word
Embeddings

Negative sampling
deeplearning.ai
Defining a new learning problem
I want a glass of orange juice to go along with my cereal.

[Mikolov et. al., 2013. Distributed representation of words and phrases and their compositionality] Andrew Ng
Model
&'( )*
%
Softmax: ! "# =
-.,... &,( )*
∑01- % context word target?
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0

Andrew Ng
Selecting negative examples
context word target?
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0

Andrew Ng
NLP and Word
Embeddings

GloVe word vectors


deeplearning.ai
GloVe (global vectors for word representation)

I want a glass of orange juice to go along with my cereal.

[Pennington et. al., 2014. GloVe: Global vectors for word representation] Andrew Ng
Model

Andrew Ng
A note on the featurization view of word
embeddings
Man Woman King Queen
(5391) (9853) (4914) (7157)
Gender −1 1 -0.95 0.97
Royal 0.01 0.02 0.93 0.95
Age 0.03 0.02 0.70 0.69
Food 0.09 0.01 0.02 0.01

6
minimize ∑78,888
*:7 ∑ 78,888
+:7 ( )*+ ,*- .+ + 0* − 0+2 − log )*+

Andrew Ng
NLP and Word
Embeddings

Sentiment
deeplearning.ai
classification
Sentiment classification problem
! "
The dessert is excellent.

Service was quite slow.

Good for a quick meal, but nothing


special.

Completely lacking in good taste,


good service, and good ambience.

Andrew Ng
Simple sentiment classification model
The dessert is excellent
8928 2468 4694 3180

The #$%&$ , -$%&$

desert #&'($ , -&'($

is #'(%' , -'(%'

excellent #)*$+ , -)*$+

“Completely lacking in good


taste, good service, and good
ambience.” Andrew Ng
RNN for sentiment classification "8

softmax

: ;+<
: ;*< :;&< :;)< :;'< ⋯ :;*+<

-*$6& -'%(( -''&7 -)$$& -))+

, , , , ,

Completely lacking in good …. ambience

Andrew Ng
NLP and Word
Embeddings

Debiasing word
deeplearning.ai
embeddings
The problem of bias in word embeddings
Man:Woman as King:Queen

Man:Computer_Programmer as Woman: Homemaker

Father:Doctor as Mother: Nurse

Word embeddings can reflect gender, ethnicity, age, sexual


orientation, and other biases of the text used to train the
model.

[Bolukbasi et. al., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings] Andrew Ng
Addressing bias in word embeddings
1. Identify bias direction.

2. Neutralize: For every word that


is not definitional, project to get rid
of bias.

3. Equalize pairs.

[Bolukbasi et. al., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings] Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Sequence to
sequence models

Basic models
deeplearning.ai
Sequence to sequence model
% "&$ % "*$ % "+$ % ",$ % "-$
Jane visite l’Afrique en septembre
Jane is visiting Africa in September.
. "&$ . "*$ . "+$ . ",$ . "-$ . "/$

!"#$ ⋯

% "&$ % "'( $

[Sutskever et al., 2014. Sequence to sequence learning with neural networks]


[Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation] Andrew Ng
Image captioning . "&$ . "*$ . "+$ . ",$ . "-$ . "/$
A cat sitting on a chair

MAX-POOL MAX-POOL

11 × 11 3× 3 5× 5 3× 3 3×3 3×3
s=4 s=2 same s=2 same
55×55 ×96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384

MAX-POOL
= ⋮ ⋮ ⋮
3×3 3×3
s=2 Softmax
1000
13×13 ×384 13×13 ×256 6×6 ×256 9216 4096 4096

.9 "&$ .9 "*$ .9 "':$


%
[Mao et. al., 2014. Deep captioning with multimodal recurrent neural networks]
[Vinyals et. al., 2014. Show and tell: Neural image caption generator]
[Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions] Andrew Ng
Sequence to
sequence models

Picking the most


deeplearning.ai
likely sentence
Machine translation as building a conditional
language model
'( "&$ '( ".$ '( "*, $

Language model: !"#$ ⋯

% "&$ % ".$

'( "&$ '( "*, $

Machine translation: !"#$ ⋯ ⋯

% "&$ % "*+ $

Andrew Ng
Finding the most likely translation
Jane visite l’Afrique en septembre. /(' "&$ , … , ' "*, $ | %)

Jane is visiting Africa in September.


Jane is going to be visiting Africa in September.
In September, Jane will visit Africa.
Her African friend welcomed Jane in September.

arg max /(' "&$, … , ' "*, $| %)


:;<= ,…,:;>, =

Andrew Ng
Why not a greedy search?
'( "&$ '( "*, $

!"#$ ⋯ ⋯

% "&$ % "*+ $

Jane is visiting Africa in September.


Jane is going to be visiting Africa in September.

Andrew Ng
Sequence to
sequence models

Beam search
deeplearning.ai
Beam search algorithm
Step 1
a 0(' "&$ | %)

'( "&$
in
⋮ !"#$ ⋯
10000 jane
⋮ % "&$ % "*+ $
september

zulu

Andrew Ng
Beam search algorithm
Step 1 Step 2
!"#$ ⋯
a
⋮ % "&$ % "*+ $
in

10000 jane
⋮ !"#$ ⋯
september
% "&$ % "*+ $

zulu
!"#$ ⋯

% "&$ % "*+ $ Andrew Ng


Beam search (4 = 3)
in september '( "7$
in september !"#$ ⋯
% "&$ % "*+ $
jane is '( "7$

jane is !"#$ ⋯
% "&$ % "*+ $
jane visits '( "7$

jane visits !"#$ ⋯


% "&$ % "*+ $

0(' "&$ , ' "9$ | %) jane visits africa in september. <EOS>


Andrew Ng
Sequence to
sequence models

Refinements to
deeplearning.ai
beam search
Length normalization
45

arg max ( ) * +,- ., * +0- , … , * +,20- )


'
,60

45

arg max 7 log ) * +,- ., * +0- , … , * +,20- )


'
,60

45

7 log ) * +,- ., * +0- , … , * +,20- )


,60
Andrew Ng
Beam search discussion

Beam width B?

Unlike exact search algorithms like BFS (Breadth First Search) or


DFS (Depth First Search), Beam Search runs faster but is not
guaranteed to find exact maximum for arg max )(*|.).
'

Andrew Ng
Sequence to
sequence models

Error analysis on
deeplearning.ai
beam search
Example
Jane visite l’Afrique en septembre.

Human: Jane visits Africa in September.

Algorithm: Jane visited Africa last September.

!"#$ ⋯
% "&$ % "'( $

Andrew Ng
Error analysis on beam search
Human: Jane visits Africa in September. (+ ∗ )
Algorithm: Jane visited Africa last September. (+.)

Case 1:
Beam search chose +.. But + ∗ attains higher / + % .
Conclusion: Beam search is at fault.
Case 2:
+ ∗ is a better translation than +.. But RNN predicted / + ∗ % < / +. % .
Conclusion: RNN model is at fault.
Andrew Ng
Error analysis process
Human Algorithm / +∗ % / +. % At fault?

Jane visits Africa in Jane visited Africa


September. last September.

Figures out what faction of errors are “due to” beam


search vs. RNN model Andrew Ng
Sequence to
sequence models

Bleu score
deeplearning.ai
(optional)
Evaluating machine translation
French: Le chat est sur le tapis.

Reference 1: The cat is on the mat.

Reference 2: There is a cat on the mat.

MT output: the the the the the the the.

Precision: Modified precision:

[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu score on bigrams
Example: Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT output: The cat the cat on the mat.

the cat
cat the
cat on
on the
the mat
[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu score on unigrams
Example: Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT output: The cat the cat on the mat.

% 0123456(7 (239:;<=) % 0123456(7 (3:;<=)


&'()*+,∈./ ')*+,∈./
!" = !' =
% 01234 (239:;<=) % 01234 (3:;<=)
&'()*+,∈./ ')*+,∈./

[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Bleu details
!' = Bleu score on n-grams only
Combined Bleu score:

1 if MT_output_length > reference_output_length


BP =
exp(1 − MT_output_length/reference_output_length) otherwise

[Papineni et. al., 2002. Bleu: A method for automatic evaluation of machine translation] Andrew Ng
Sequence to
sequence models

Attention model
deeplearning.ai
intuition
The problem of long sequences "* $
'( "&$ '( ,

!"#$ ⋯ ⋯

% "&$ % "*+ $
Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré beaucoup de
gens merveilleux; elle est revenue en parlant comment son voyage était merveilleux, et elle me tente
d'y aller aussi.
Jane went to Africa last September, and enjoyed the culture and met many wonderful people;
she came back raving about how wonderful her trip was, and is tempting me to go too.

Bleu
score

10 20 30 40 50 Sentence length Andrew Ng


Attention model intuition

'( "&$ '( ".$ '( "/$ '( "1$ '( "0$

!"#$

% "&$ % ".$ % "/$ % "1$ % "0$


jane visite l’Afrique en septembre
[Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate] Andrew Ng
Sequence to
sequence models

Attention model
deeplearning.ai
Attention model

)"*$

! "#$ ! "%$ ! "&$ ! "($ ! "'$


jane visite l’Afrique en septembre
[Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate] Andrew Ng
",,, . $
Computing attention +
",,, . $ ", $ ", . $
+ = amount of attention / should pay to )
/C ",@#$ /C ",$
.
",,, . $ 123(5 67,7 8 )
+ = = . ? ",@#$ ? ",$
∑ .> 123(5 67,7 8 )
7 ;<
+
? ",@#$
",,, . $
A
", . $ )"*$ ⋯
)
! "#$ ! "%$ ! "E> @#$ ! "E> $

[Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate]
[Xu et. al., 2015. Show, attend and tell: Neural image caption generation with visual attention] Andrew Ng
Attention examples
July 20th 1969 1969 − 07 − 20

23 April, 1564 1564 − 04 − 23

",,, .$
Visualization of + :

Andrew Ng
Audio data

Speech recognition
deeplearning.ai
Speech recognition problem
! #
audio clip transcript

“the quick brown fox”

Andrew Ng
Attention model for speech recognition
“T” “h”

% &'( % &)( ⋯

+&,( ⋯

! &'( ! &)( ⋯ ! &---( ! &',,,(

Andrew Ng
CTC cost for speech recognition
(Connectionist temporal classification)
“the quick brown fox”
#. &'( #. &)( #. &',,,(

+&,( ⋯

! &'( ! &)( ! &',,,(

Basic rule: collapse repeated characters not separated by “blank”


[Graves et al., 2006. Connectionist Temporal Classification: Labeling unsegmented sequence data with recurrent neural networks] Andrew Ng
Audio data

Trigger word
deeplearning.ai
detection
What is trigger word detection?

Amazon Echo Baidu DuerOS Apple Siri Google Home


(Alexa) (xiaodunihao) (Hey Siri) (Okay Google)

Andrew Ng
Trigger word detection algorithm

!"#$

% "&$ % "'$ % "($

Andrew Ng
Conclusion

Summary and
deeplearning.ai
thank you
Specialization outline

1. Neural Networks and Deep Learning


2. Improving Deep Neural Networks: Hyperparameter
tuning, Regularization and Optimization
3. Structuring Machine Learning Projects
4. Convolutional Neural Networks
5. Sequence Models

Andrew Ng
Deep learning is a super power

Please buy this


from shutterstock
and replace in
final video.

Andrew Ng
Thank you.
-Andrew Ng

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Sequence to
sequence models

Transformers
deeplearning.ai
Intuition
Transformers Motivation
Increased complexity,
sequential

RNN GRU LSTM

𝑦ො <1> 𝑦ො <2> 𝑦ො <𝑇𝑦>


Andrew Ng
Transformers Intuition
• Attention + CNN
• Self-Attention

• Multi-Head Attention

[Vaswani et al. 2017, Attention Is All You Need] Andrew Ng


Sequence to
sequence models

Self-Attention
deeplearning.ai
Self-Attention Intuition
𝐴(𝑞, 𝐾, 𝑉)= attention-based vector representation of a word

RNN Attention Transformers Attention

′ <𝑖> >
exp(𝑒 <𝑡,𝑡 > ) exp(𝑒 <𝑞∙𝑘 )
𝛼 <𝑡,𝑡 ′>
= 𝐴(𝑞, 𝐾, 𝑉)= ∑𝑖 <𝑗> > 𝑣 <𝑖>
𝑇 ′
∑ ′𝑥 exp(𝑒 <𝑡,𝑡 > ) ∑𝑗 exp(𝑒 <𝑞∙𝑘 )
𝑡 =1

𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <4> 𝑥 <5>


Jane visite l’Afrique en septembre
[Vaswani et al. 2017, Attention Is All You Need] Andrew Ng
Self-Attention
<𝑖> >
exp(𝑒 <𝑞∙𝑘 )
𝐴(𝑞, 𝐾, 𝑉)= ∑𝑖 <𝑗> > 𝑣 <𝑖>
∑𝑗 exp(𝑒 <𝑞∙𝑘 )
𝑸𝑲𝑻
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑸, 𝑲, 𝑽 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝑽
𝒅𝒌

𝐴<1> 𝐴<2> 𝐴<3> 𝐴<4> 𝐴<5>

+
Query (𝑄) Key (𝐾) Value (𝑉)
𝑣 <1> x 𝑣 <2>
x 𝑣 <3>
x 𝑣 <4>
x 𝑣 <5>
𝑞 <1> 𝑘 <1> 𝑣 <1>
𝑞 <2> 𝑘 <2> 𝑣 <2>
𝑞 <3> ∙ 𝑘 <1> 𝑞 <3> ∙ 𝑘 <2> 𝑞 <3> ∙ 𝑘 <4> 𝑞 <3> ∙ 𝑘 <5> 𝑞 <3> 𝑘 <3> 𝑣 <3>
𝑞 <4> 𝑘 <4> 𝑣 <4>
𝑞 <5> 𝑘 <5> 𝑣 <5>
𝑞 <1> , 𝑘 <1> , 𝑣 <1> 𝑞 <2> , 𝑘 <2> , 𝑣 <2> 𝑞 <3> , 𝑘 <3> , 𝑣 <3> 𝑞 <4> , 𝑘 <4> , 𝑣 <4> 𝑞 <5> , 𝑘 <5> , 𝑣 <5>

𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <4> 𝑥 <5>


Jane visite l’Afrique en septembre
[Vaswani et al. 2017, Attention Is All You Need] Andrew Ng
Sequence to
sequence models

Multi-Head
deeplearning.ai
Attention
Multi-Head Attention
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝑐𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1 ℎ𝑒𝑎𝑑2 … ℎ𝑒𝑎𝑑𝑛 𝑊𝑜
𝑄
𝑸𝑲𝑻 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑊𝑖 𝑄, 𝑊𝑖𝐾 𝐾, 𝑊𝑖𝑉 𝑉)
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑸, 𝑲, 𝑽 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝑽
𝒅𝒌
Multi-Head
Attention
𝑸
𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏(𝑾𝟏𝑸 𝑲
𝒊 𝑸, 𝑾𝟏
𝑽
𝒊 𝑲, 𝑾𝟏
𝒊 𝑽)

Q
W3 , W3K , W3V , Who?

𝑊1𝑖𝑄 𝑞<1> , 𝑊1𝑖𝐾 𝑘 <1> ,𝑊1𝑖𝑉 𝑣 <1> 𝑊1𝑖𝑄 𝑞<2> , 𝑊1𝑖𝐾 𝑘 <2> ,𝑊1𝑖𝑉 𝑣 <2> 𝑄
𝑊1𝑖𝑄 𝑞<3> , 𝑊1𝑖𝐾 𝑘 <3> ,𝑊1𝑖𝑉 𝑣 <3> 𝑊1𝑖𝑄 𝑞<4> , 𝑊1𝑖𝐾 𝑘 <4> ,𝑊1𝑖𝑉 𝑣 <4> 𝑊1𝑖𝑄 𝑞<5> , 𝑊1𝑖𝐾 𝑘 <5> ,𝑊1𝑖𝑉 𝑣 <5>
Q
W2 , W2K , W2V , When?
Q
W1 , W1K , W1V , Did what?
<1> <1> <1> <3> <3> <3>
𝑞 ,𝑘 ,𝑣 𝑞 <2> , 𝑘 <2> , 𝑣 <2> 𝑞 ,𝑘 ,𝑣 𝑞 <4> , 𝑘 <4> , 𝑣 <4> 𝑞 <5> , 𝑘 <5> , 𝑣 <5>

𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <4> 𝑥 <5> W , W , W Q K V

Jane visite l’Afrique en septembre


[Vaswani et al. 2017, Attention Is All You Need] Andrew Ng
Sequence to
sequence models

Transformers
deeplearning.ai
Transformer Details <SOS> Jane visits Africa in September <EOS>
Softmax
Encoder Decoder Linear

Add & Norm

Add & Norm Feed Forward


Neural Network
Feed Forward
Neural Network
Add & Norm

Add & Norm Multi-Head


Attention
Multi-Head
Attention
Add & Norm

Masked
Multi-Head
Multi-Head
Attention
Positional Encoding Attention
+
𝑝𝑜𝑠
<SOS> 𝑥 <1> 𝑥 <2> … 𝑥 <𝑇𝑥−1> 𝑥 <𝑇𝑥> <EOS> 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛( 2𝑖
)
1000 𝑑
Jane visite l’Afrique en septembre 𝑝𝑜𝑠 +
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠( )
2𝑖 <SOS> 𝑦 <1> 𝑦 <2> … 𝑦 <𝑇𝑦−1> 𝑦 <𝑇𝑦>
1000 𝑑

[Vaswani et al. 2017, Attention Is All You Need] Andrew Ng

You might also like