0% found this document useful (0 votes)

12 views87 pages

ML Lec 10 Neural Networks

The document discusses the fundamentals of neural networks and deep learning, emphasizing the importance of learning features directly from data rather than relying on hand-engineered features. It outlines the historical context of neural networks, the structure and function of perceptrons, and the significance of activation functions in introducing non-linearities. Additionally, it covers training neural networks, loss optimization, and the challenges associated with setting learning rates.

Uploaded by

alidashtbozorg709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views87 pages

ML Lec 10 Neural Networks

Uploaded by

alidashtbozorg709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Neural Network and Deep Learning

Why Deep Learning?

Hand engineered features are time consuming, brittle and not scalable in practice
Can we learn the underlying features directly from data?

Low Level Features Mid Level Features High Level Features

Lines & Edges Eyes & Nose & Ears Facial Structure
Why Now?
Neural Networks date back decades, so why the resurgence?
Stochastic Gradient
1952
Descent

1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition
The Perceptron
The structural building block of deep learning
The Perceptron: Forward Propagation

Linear combination
Output of inputs

&" !" #

(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#

Inputs Weights Sum Non-Linearity Output

The Perceptron: Forward Propagation

Linear combination
Output of inputs
1
!%
#

&" !" )( = + !%+ - &. !.

Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function

Inputs Weights Sum Non-Linearity Output

The Perceptron: Forward Propagation

1
!% #

)( = + !%+ - &. !.
&" !" ./"

!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#

Inputs Weights Sum Non-Linearity Output

The Perceptron: Forward Propagation

Activation Functions
1
!%
*) = , !% + . / 0
&" !"

!$ Σ *) • Example: sigmoid function

&$ 1
!# , 1 = 2 1 =
1 + 3 45

Inputs Weights Sum Non-Linearity Output

1
Common Activation Functions
Sigmoid Function Hyperbolic Tangent Rectified Linear Unit (ReLU)

1 & ( − & '(

! " = ! " = ( ! " = max ( 0 , " )
1 + & '( & + & '(
1, " > 0
! ′ " = !(") 1 − !(") ! ′ " = 1 − !(")- !′(") = 3
0, otherwise

tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)

NOTE: All activation functions are non-linear

Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

What if we wanted to build a Neural Network to

distinguish green vs red points?
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

Linear Activation functions produce linear

decisions no matter the network size
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

Linear Activation functions produce linear Non-linearities allow us to approximate

decisions no matter the network size arbitrarily complex functions
The Perceptron: Example

3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
(
2&
−
3
&' Σ *)

'
3&
1+
−2
&'
&(
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
−1

(
2&
2

−
3
&' Σ *)

'
3&
1+
−2
&'
&(

−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
1 < 0

(
2&
* < 0.5

−
3
&' Σ *)

'
3&
1+
−2
&'
&(

1 > 0
* > 0.5
Building Neural Networks with Perceptrons
The Perceptron: Simplified

1
!%

&" !"

!$ Σ *)
&$
!#

Inputs Weights Sum Non-Linearity Output

The Perceptron: Simplified

&=( %
!" %

#
% = )* + , !- )-
-.$
Multi Output Perceptron

!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#

#
%) = *+,) + . !/ */,)
/0$
Single Layer Neural Network

($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*

Inputs Hidden Final Output

# )*
($) ($) (") (")
%+ = -.,+ +3 !4 -4,+ ('+ = 6 -.,+ +3 %4 -4,+
45$ 45$
Single Layer Neural Network

%$
($)
!$ 5$,"

($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
Multi Output Perceptron
from tf.keras.layers import *

inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)

!$
%" ('$
!"
%& ('"
!#
%)*

Inputs Hidden Output

Deep Neural Network

%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,

Inputs Hidden Output

+,78
(&) (&)
%&,- = /0,- +4 9(%&:$,5 ) /5,-
56$
Applying Neural Networks
Example Problem

Will I pass this class?

Let’s start with a simple two feature model

! " = Number of lectures you attend

! $ = Hours spent on the final project
Example Problem: Will I pass this class?

! " = Hours
spent on the
final project

Legend

Pass
Fail

! $ = Number of lectures you attend

Example Problem: Will I pass this class?

! " = Hours
spent on the
final project

Legend
? Pass
4
5 Fail

! $ = Number of lectures you attend

Example Problem: Will I pass this class?

$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%
Example Problem: Will I pass this class?

$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions

$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%

ℒ , ! (.) ; 1 , ' (.)

Predicted Actual
Empirical Loss

The empirical loss measures the total loss over our entire dataset

+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%

1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%

1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted

loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )

Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%

1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted

loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )

Training Neural Networks
Loss Optimization

We want to find the network weights that achieve the lowest loss

1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./

!∗ = argmin 9(!)
!
Loss Optimization

We want to find the network weights that achieve the lowest loss

1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./

!∗ = argmin 9(!)
!

Remember:
! = !(:) , !(/), ⋯
Loss Optimization
!∗ = argmin *(!)
! Remember:
Our loss is a function of
the network weights!

*(-., -0)

-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)

'("#, "%)

"%
"#
Loss Optimization
!"($)
Compute gradient, !$

&('(, '*)

'*
'(
Loss Optimization
Take small step in opposite direction of gradient

!(#$, #&)

#&
#$
Gradient Descent
Repeat until convergence

!(#$, #&)

#&
#$
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)

2. Loop until convergence:

3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)

4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)

2. Loop until convergence:

3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)

4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Computing Gradients: Backpropagation

!) !"
' () +* #(%)

How does a small change in one weight (ex. !") affect the final loss #(%)?
Computing Gradients: Backpropagation

&+ &'
) *+ -, "($)

!"($)
=
!&'

Let’s use the chain rule!

Computing Gradients: Backpropagation

&. &'
, -. *) "($)

!"($) !"($) !*)

= ∗
!&' !*) !&'
Computing Gradients: Backpropagation

&' &.
, -' *) "($)

!"($) !"($) !*)

= ∗
!&' !*) !&'

Apply chain rule! Apply chain rule!

Computing Gradients: Backpropagation

&' &.
- ,' *) "($)

!"($) !"($) !*) !,'

= ∗ ∗
!&' !*) !,' !&'
Computing Gradients: Backpropagation

&' &.
- ,' *) "($)

!"($) !"($) !*) !,'

= ∗ ∗
!&' !*) !,' !&'

Repeat this for every weight in the network using gradients from later layers
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult

“Visualizing the loss landscape

of neural nets”. Dec 2017.
Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

%& (!)
!←!−$
%!
Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

%& (!)
!←!−$
%!

How can we set the

learning rate?
Setting the Learning Rate

Small learning rate converges slowly and gets stuck in false local minima

"(!)

Initial guess

!
Setting the Learning Rate

Large learning rates overshoot, become unstable and diverge

"(!)

Initial guess

!
Setting the Learning Rate

Stable learning rates converge smoothly and avoid local minima

"($)

Initial guess

!
How to deal with this?

Idea 1:
Try lots of different learning rates and see what works “just right”
How to deal with this?

Idea 1:
Try lots of different learning rates and see what works “just right”

Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape
Adaptive Learning Rates

• Learning rates are no longer fixed

• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...
Adaptive Learning Rate Algorithms
Qian et al. “On the momentum term in gradient
• Momentum tf.train.MomentumOptimizer descent learning algorithms.” 1999.

Duchi et al. “Adaptive Subgradient Methods for Online

• Adagrad tf.train.AdagradOptimizer Learning and Stochastic Optimization.” 2011.

Zeiler et al. “ADADELTA: An Adaptive Learning Rate

• Adadelta tf.train.AdadeltaOptimizer
Method.” 2012.

• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.

• RMSProp tf.train.RMSPropOptimizer

Additional details: http://ruder.io/optimizing-gradient-descent/

Neural Networks in Practice:
Mini-batches
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, = / ∑/
12.
*, *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Stochastic Gradient Descent

More accurate estimation of gradient

Smoother convergence
Allows for larger learning rates
Mini-batches while training

More accurate estimation of gradient

Smoother convergence
Allows for larger learning rates

Mini-batches lead to fast training!

Can parallelize computation + achieve significant speed increases on GPU’s
Neural Networks in Practice:
Overfitting
The Problem of Overfitting

Underfitting Ideal fit Overfitting

Model does not have capacity Too complex, extra parameters,
to fully learn the data does not generalize well
Regularization

What is it?
Technique that constrains our optimization problem to discourage complex models
Regularization

What is it?
Technique that constrains our optimization problem to discourage complex models

Why do we need it?

Improve generalization of our model on unseen data
Regularization 1: Dropout
• During training, randomly set some activations to 0

'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node

'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Loss