0% found this document useful (0 votes)
7 views

CMPE 442 Introduction To Machine Learning: Artificial Neural Networks

Uploaded by

mert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

CMPE 442 Introduction To Machine Learning: Artificial Neural Networks

Uploaded by

mert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CMPE 442 Introduction to

Machine Learning
• Artificial Neural Networks
ANNs

 Were first introduced in 1943 by neurophysiologist Warren McCulloch and


Walter Pitts.
 Presented simplified computational model of how biological neurons might
work together in animal brains to perform complex computations using
propositional logic.
 In 1960s went to dark era
 In 1980s a new wave of ANN interest with better training techniques
 In 1990s other powerful techniques such as SVMs were favored
 Now, we are witnessing a new wave of interest to ANNs
 Is this time different?
The Brain and the Neuron

 If we can understand how the brain works we could copy and use for ML
systems.
 The brain is an impressively powerful and complicated system.
 However, the basic building blocks of brain are fairly simple and easy to
understand.
 Neurons processing units of brain.
 There are 100 billion of them.
 Each neuron can be viewed as a separate processor performing simple
computation: deciding whether or not to fire.
Biological Neuron
Hebb’s Rule

 If two neurons consistently fire simultaneously, then any connection


between them will change in strength, become stronger.
 The idea: if two neurons both respond to something , then they should be
connected.
 Pavlov used this idea: classical conditioning
 Food was shown and bell was rung at the same time neurons for salivating over
the food and hearing the bell fired simultaneously, and so became strongly
connected
McCulloch and Pitts Neurons

 Mathematical model of neurons


 McCulloch and Pitts modelled a neuron as:
1) A set of inputs 𝒙𝒊 that correspond to the synapses
2) An adder that sums the input signals
3) A threshold function
McCulloch and Pitts Neurons

∈ {0,1}

All 1s, equal weights for every input

∈ {0,1}
McCulloch and Pitts Neurons
Example:
𝑥 = 1; 𝑥 = 0; 𝑥 = 1

• In real neurons those inputs come from the outputs


of other neurons through their synapse

• Those synapse have strengths, called weights

• The strength of the synapse affects the strength of


the signal
McCulloch and Pitts Neurons
Example:
𝑥 = 1; 𝑥 = 0; 𝑥 = 1
Output=𝑔(ℎ)
ℎ  In real neurons those inputs come from the outputs
 of other neurons through their synapse

 Those synapse have strengths, called weights

 The strength of the synapse affects the strength of


 the signal

1 𝑖𝑓 ℎ ≥ 𝜃
ℎ= 𝑥 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓 ℎ < 𝜃
Some Boolean Functions

Can any Boolean function be represented using McCulloch Pitts unit?


McCulloch and Pitts Neurons

 The McCulloch and Pitts neuron is a binary threshold device (0/1)


 Can be used to represent Boolean functions which are linearly separable
 A very simple model
 Threshold should be picked
 No learnable
 Does not consider non- Boolean inputs
 All inputs are given the same weights
The Perceptron

 A more general computational model


 Inputs and outputs are now numbers
 One of the simplest ANN architectures
 Slightly different artificial neurons-linear threshold unit (LTU)
 Inputs have their weights
The Perceptron

1 𝑖𝑓 𝑤 𝑥 − 𝜃 ≥ 0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓𝑤 𝑥 − 𝜃 < 0
The Perceptron

ℎ= 𝑤𝑥 1 𝑖𝑓 ℎ ≥ 0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓 ℎ < 0

𝑥 =1 𝑤 = −𝜃
The Perceptron
The Perceptron

 Neurons in Perceptron are completely independent of each other: it


doesn’t matter to any neuron what the others doing
 The only thing that they share is the input
 Number of inputs equals the number of features
 Number of inputs does not have to be equal to number of neurons: in
general there will be 𝑚 inputs and 𝑛 neurons
The Perceptron Algorithm

 Initialisation
- Set all of the weights 𝑤 to small (positive and negative) random numbers
 Training
- For T iterations:
 For each input vector
o Compute the activation of each neuron 𝑗 using activation function g:
1 𝑖𝑓 𝑤 𝑥 ≥ 0
𝑦 =𝑔 ∑ 𝑤 𝑥 =
0 𝑖𝑓𝑤 𝑥 < 0

o Update each of the weights individually using:


𝑤 ← 𝑤 + 𝞰(𝑡 − 𝑦 )𝑥

 Recall
1 𝑖𝑓 𝑤 𝑥 ≥ 0
- Compute the activation of each neuron 𝑗 using: 𝑦 = 𝑔 ∑ 𝑤 𝑥 =
0 𝑖𝑓𝑤 𝑥 < 0
Perceptron: Example

 Logical OR
 There are three weights
 Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
Perceptron: Example

 Logical OR
 There are three weights
 Assume η=0.25
 Bias term is 1
 Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
 Input (0,0)
−0.05 × 1 + −0.02 × 0 + 0.02 × 0 = −0.05
Perceptron: Example

 Logical OR
 There are three weights
 Assume η=0.25
 Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
 Bias term is 1
 Input (0,0)
−0.05 × 1 + −0.02 × 0 + 0.02 × 0 = −0.05

𝑤 : −0.05 + 0.25 × 0 − 0 × 1 = −0.05


𝑤 : −0.02 + 0.25 × 0 − 0 × 0 = −0.02
𝑤 : 0.02 + 0.25 × 0 − 0 × 0 = 0.02
Perceptron: Example

 η=0.25
 𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
 Input (0,1)
−0.05 × 1 + −0.02 × 0 + 0.02 × 1 = −0.03

𝑤 : −0.05 + 0.25 × 1 − 0 × 1 = 0.2


𝑤 : −0.02 + 0.25 × 1 − 0 × 0 = −0.02
𝑤 : 0.02 + 0.25 × 1 − 0 × 1 = 0.27
Perceptron

 What does the Perceptron really compute?


Perceptron

 What does the Perceptron really compute?


 It tries to find a straight line (in 2D, plane in 3D, and hyperplane in higher
dimensions) where the neuron fires on one side of the line and does not on the
other.
 So, this is simply a decision boundary that the Perceptron tries to find

𝑖𝑓 𝑤 𝑥 ≥ 0 𝑛𝑒𝑢𝑟𝑜𝑛 𝑓𝑖𝑟𝑒𝑠
𝑤 𝑥 = 0 𝑜𝑛 𝑡ℎ𝑒 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦
Perceptron

 What happens when we have more than one neuron?


Perceptron

 What happens when we have more than one neuron?


 The weights for each neuron separately describe a straight line
 When we put together several neurons we get several straight lines
 Each straight line try to separate different parts of the space

c2

c4

c3 c1
XOR function

 In 1969 Marvin Minsky and Seymour Papert demonstrated that Perceptron is


incapable of solving XOR function
 This is true for any linear classification models (Ex. Logistic Regression)
 This discovery was the reason of dark era in ANNs.
Multi-Layer Perceptron

 Majority of interesting problems are not linearly separable


 We saw before that problems can be made linearly separable if we
increase the dimensionality of data
 In ANNs we can instead make the network more complicated
 Learning in the ANNs happens in the weights
 Add more neurons between the input nodes and outputs, and get more
complex neural networks
MLP: XOR function

 Some of the limitations of Perceptron can be eliminated by stacking


multiple Perceptrons  Multi-Layer Perceptron
MLP

 MLP is composed of:


 One input layer
 One or more LTUs, called hidden layers
 One output layer
 Every layer except for the output layer includes bias term
 Every layer is fully connected to the next layer
 When ANN has two or more hidden layers, it is called Deep Neural Networks
(DNN)
MLP
MLP Training

 Training goes in two steps: forwards phase and backwards phase


 For each training instance feed it to the network and compute the output
 Measure the networks output error
 Compute how much each neuron in the last hidden layer contributed to
each output neuron’s error
 How much of this error contributions came from the each neuron in the
previous layer
 Continues until input layer is reached
 Algorithm propagates the error gradient backward in the network
 The last step of the backpropagation algorithm is a Gradient Descent step
on all connection weights in the network
Logistic Units

 Since step function is not differentiable, in MLP it is replaced by logistic function


Neural Networks

 By using Neural Networks we can get highly complex nonlinear functions

x2

x1
Neural Networks

 By using Neural Networks we can get highly complex nonlinear functions

x2

x1
Neural Network (Classification)

Layer 1 Layer 2 Layer 3 Layer 4

Learning happens in the strength of connections (weights) of neurons


Neural Network (Classification)

{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}

𝐿 = total number of layers

𝑠 − number of units (neurons) in layer 𝑙


Layer 1 Layer 2 Layer 3 Layer 4
not counting bias unit
Neural Network (Classification)

{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}

𝐿 = total number of layers

𝑠 − number of units (neurons) in layer 𝑙


Layer 1 Layer 2 Layer 3 Layer 4
not counting bias unit
Neural Network (Classification)
𝑎 ( )

{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}

𝐿 = total number of layers

𝑠 − number of units (neurons) in layer 𝑙


Layer 1 Layer 2 Layer 3 Layer 4
not counting bias unit

𝑎 ( ) = activation of unit 𝑖 in layer 𝑗


𝜃 ( ) = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗 + 1

( ) ( )
𝑧 ( ) =𝜃 𝑥 +𝜃 𝑥 + 𝜃 ( )𝑥 + 𝜃 ( )
𝑥
𝑎 ( ) = 𝑔(𝑧 ( ) )
Neural Network (Classification)
𝑎 ( )

{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}

𝐿 = total number of layers

𝑠 − number of units (neurons) in layer 𝑙


Layer 1 Layer 2 Layer 3 Layer 4
not counting bias unit

𝑎 ( ) = activation of unit 𝑖 in layer 𝑗


𝜃 ( ) = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗 + 1

( ) ( ) ( )
𝑧 ( ) =𝜃 𝑎 ( ) +𝜃 𝑎 ( ) +𝜃 𝑎 ( ) + 𝜃 ( )𝑎 ( ) +𝜃 ( )
𝑎 ( ) +𝜃 ( )
𝑎 ( )

𝑎 ( ) = 𝑔(𝑧 ( ) )
Neural Network (Classification)

Layer 1 Layer 2 Layer 3 Layer 4

Binary Classification: 𝐾=2

If 𝑦 = 0 𝑜𝑟 1 then only one output unit


Neural Network (Classification)

Layer 1 Layer 2 Layer 3 Layer 4

Multi-Class Classification: 𝐾≥3

𝑦 ∈ ℝ , 𝐾 output units
Logistic Regression: Cost function

 𝐽 𝜃 =− ∑ 𝑦 ( ) log ℎ 𝑥 + 1−𝑦 log 1 − ℎ 𝑥 + λ∑ 𝜃


Neural Network: Cost Function

 ℎ ∈ℝ (ℎ (𝑥)) = 𝑘 output
𝐽 𝜃
1
=− 𝑦 log ℎ 𝑥 + (1 − 𝑦 ) log 1 − ℎ 𝑥
𝑚

λ ()
+ (𝜃 )
2𝑚
Neural Network (Learning)

𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎
Neural Network (Learning)

𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎

For each training sample 𝑖


Forward Propagation
Neural Network (Learning)

𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( )
−𝑦
()
𝑎
𝑥 () 𝑎
( )
−𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( ) −𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )
Neural Network (Learning)

𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎

𝛿( ) 𝛿( ) 𝛿( )

For each training sample 𝑖


Backpropagation
Neural Network (Learning)

𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( )
−𝑦
()
𝑎
𝑥 () 𝑎
( )
−𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( ) −𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

For each training sample 𝑖


Backpropagation
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

𝛿 ( ) = error of node j in layer l


Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

𝛿 ( ) = error of node j in layer l


𝛿( )
=𝑦 −𝑎 ( ) 𝛿( )
=𝑦−𝑎
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

𝛿 ( ) = error of node j in layer l


𝛿( )
=𝑦 −𝑎 ( ) 𝛿( )
=𝑦−𝑎

𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

𝛿 ( ) = error of node j in layer l


𝛿( )
=𝑦 −𝑎 ( ) 𝛿( )
=𝑦−𝑎

𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )

𝛿( )
= (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )

𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎

𝛿( ) 𝛿( ) 𝛿( )

𝛿 ( ) = error of node j in layer l


𝛿( )
=𝑦 −𝑎 ( ) 𝛿( )
=𝑦−𝑎

𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )

𝛿( )
= (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )

𝜕 ( )𝛿 ( )
𝐽 𝜃 =𝑎
𝜕𝜃
Algorithm
 Initialization
- Initialize all the weights to small (positive and negative) random values
 Training
 Repeat
- { 𝑥( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}

()
- Set ∆ : = 0 for all l, i, j
- For k=1 to m
Forward Propagation:
- Compute the activation of each neuron j in the hidden layer 𝑎 ( ) for l=2,3,…,L

Backward Propagation:
- Compute the error at the output 𝛿 ( )
=𝑦 − 𝑎( )

- Compute 𝛿 ,𝛿 , … , 𝛿( )

- ∆ ()
≔∆ +𝑎 ()
𝛿( )

() () ()
- 𝐷 ≔ ∆ + λ𝜃 if 𝑗 ≠ 0
() () 𝜕
- 𝐷 ≔ ∆ if 𝑗 = 0 (𝑗 = 0 is a bias term) ( 𝐽 𝜃 =𝐷 )
() ()
𝜕𝜃
- 𝜃 ≔𝜃 + 𝞰𝐷
 Until learning stops
 Recall
- Use Forward Propagation in the training phase
Expressive Capabilities of ANNS

 Boolean functions:
 Every Boolean function can be represented by network with single hidden layer
 But might require exponential (in number of inputs) hidden units
 Continuous functions:
 Every bounded continuous function can be approximated with arbitrarily small
error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
 Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988]
Practical Issues

 Regression: replace the output neurons with linear nodes 𝑔 𝑧 = 𝑧


 Weights: should be initialize to small random numbers, both positive and
negative.
 A common trick is to set weights in the range − <𝜃< , where 𝑛-number of
nodes in the input layer, called uniform learning.
 Keep initial weights all about the same size so that all of the weights reach their
final values at about the same time.
 Batch/Sequential learning: decision whether to update weights after
looping over all training examples or after backpropagating each training
example
Practical Issues

 Number of Hidden Layers:


 Reasonable default 1 hidden layer.
 If using more than one hidden layer than use the same number of units in every
layer (the more the better)
 Number of units in each layer:
 No theory behind to guide
 Number of hidden units in each layer should be preferably more than input
features
 Training set size: A rule of thumb is that you should use a number of training
examples that is at least 10 times the number of weights
Overfitting in ANNs
Dealing with Overfitting

 The Gradient Descent algorithm involves a parameter 𝑛- number of


gradient descent iterations
 How do we choose 𝑛 to optimize future error?
 e.g. the n that minimizes error rate of neural net over future data
Dealing with Overfitting

 The Gradient Descent algorithm involves a parameter 𝑛- number of


gradient descent iterations
 How do we choose 𝑛 to optimize future error?
 Separate available data into training and validation set
 Use training to perform gradient descent
 𝑛 ← number of iterations that optimizes validation set error

→gives unbiased estimate of optimal n


(but a biased estimate of true error)
ANNs for Face Recognition
ANNs for Autonomous Vehicle Driving

You might also like