ML Lec 10 Neural Networks
ML Lec 10 Neural Networks
Lines & Edges Eyes & Nose & Ears Facial Structure
Why Now?
Neural Networks date back decades, so why the resurgence?
Stochastic Gradient
1952
Descent
1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition
The Perceptron
The structural building block of deep learning
The Perceptron: Forward Propagation
Linear combination
Output of inputs
&" !" #
(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#
Linear combination
Output of inputs
1
!%
#
Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function
&#
1
!% #
)( = + !%+ - &. !.
&" !" ./"
!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#
Activation Functions
1
!%
*) = , !% + . / 0
&" !"
&$ 1
!# , 1 = 2 1 =
1 + 3 45
&#
3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
(
2&
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
−1
(
2&
2
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
1 < 0
(
2&
* < 0.5
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
1 > 0
* > 0.5
Building Neural Networks with Perceptrons
The Perceptron: Simplified
1
!%
&" !"
!$ Σ *)
&$
!#
&#
!$
&=( %
!" %
!#
#
% = )* + , !- )-
-.$
Multi Output Perceptron
!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#
#
%) = *+,) + . !/ */,)
/0$
Single Layer Neural Network
($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*
%$
($)
!$ 5$,"
($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
Multi Output Perceptron
from tf.keras.layers import *
inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)
!$
%" ('$
!"
%& ('"
!#
%)*
%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,
! " = Hours
spent on the
final project
Legend
Pass
Fail
! " = Hours
spent on the
final project
Legend
? Pass
4
5 Fail
$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%
Example Problem: Will I pass this class?
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
Quantifying Loss
The loss of our network measures the cost incurred from incorrect predictions
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
The empirical loss measures the total loss over our entire dataset
+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted
Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%
1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
Loss Optimization
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
Remember:
! = !(:) , !(/), ⋯
Loss Optimization
!∗ = argmin *(!)
! Remember:
Our loss is a function of
the network weights!
*(-., -0)
-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)
'("#, "%)
"%
"#
Loss Optimization
!"($)
Compute gradient, !$
&('(, '*)
'*
'(
Loss Optimization
Take small step in opposite direction of gradient
!(#$, #&)
#&
#$
Gradient Descent
Repeat until convergence
!(#$, #&)
#&
#$
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Computing Gradients: Backpropagation
!) !"
' () +* #(%)
How does a small change in one weight (ex. !") affect the final loss #(%)?
Computing Gradients: Backpropagation
&+ &'
) *+ -, "($)
!"($)
=
!&'
&. &'
, -. *) "($)
&' &.
, -' *) "($)
&' &.
- ,' *) "($)
&' &.
- ,' *) "($)
Repeat this for every weight in the network using gradients from later layers
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Small learning rate converges slowly and gets stuck in false local minima
"(!)
Initial guess
!
Setting the Learning Rate
"(!)
Initial guess
!
Setting the Learning Rate
"($)
Initial guess
!
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape
Adaptive Learning Rates
• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.
• RMSProp tf.train.RMSPropOptimizer
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, = / ∑/
12.
*, *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, = / ∑/
12.
*, *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
Mini-batches while training
What is it?
Technique that constrains our optimization problem to discourage complex models
Regularization
What is it?
Technique that constrains our optimization problem to discourage complex models
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Loss
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Training Iterations
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Under-fitting Over-fitting
Legend
Training Iterations
Core Foundation Review
(),%
"% "%
(),# '&%
"# Σ '& "# ⋯ ⋯
(),+ '&#
"$
"$
(),,-
Questions?