CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
Machine Learning
• Artificial Neural Networks
ANNs
If we can understand how the brain works we could copy and use for ML
systems.
The brain is an impressively powerful and complicated system.
However, the basic building blocks of brain are fairly simple and easy to
understand.
Neurons processing units of brain.
There are 100 billion of them.
Each neuron can be viewed as a separate processor performing simple
computation: deciding whether or not to fire.
Biological Neuron
Hebb’s Rule
∈ {0,1}
∈ {0,1}
McCulloch and Pitts Neurons
Example:
𝑥 = 1; 𝑥 = 0; 𝑥 = 1
1 𝑖𝑓 ℎ ≥ 𝜃
ℎ= 𝑥 𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓 ℎ < 𝜃
Some Boolean Functions
1 𝑖𝑓 𝑤 𝑥 − 𝜃 ≥ 0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓𝑤 𝑥 − 𝜃 < 0
The Perceptron
ℎ= 𝑤𝑥 1 𝑖𝑓 ℎ ≥ 0
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑔 ℎ =
0 𝑖𝑓 ℎ < 0
𝑥 =1 𝑤 = −𝜃
The Perceptron
The Perceptron
Initialisation
- Set all of the weights 𝑤 to small (positive and negative) random numbers
Training
- For T iterations:
For each input vector
o Compute the activation of each neuron 𝑗 using activation function g:
1 𝑖𝑓 𝑤 𝑥 ≥ 0
𝑦 =𝑔 ∑ 𝑤 𝑥 =
0 𝑖𝑓𝑤 𝑥 < 0
Recall
1 𝑖𝑓 𝑤 𝑥 ≥ 0
- Compute the activation of each neuron 𝑗 using: 𝑦 = 𝑔 ∑ 𝑤 𝑥 =
0 𝑖𝑓𝑤 𝑥 < 0
Perceptron: Example
Logical OR
There are three weights
Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
Perceptron: Example
Logical OR
There are three weights
Assume η=0.25
Bias term is 1
Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
Input (0,0)
−0.05 × 1 + −0.02 × 0 + 0.02 × 0 = −0.05
Perceptron: Example
Logical OR
There are three weights
Assume η=0.25
Pick three small random numbers. Assume :𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
Bias term is 1
Input (0,0)
−0.05 × 1 + −0.02 × 0 + 0.02 × 0 = −0.05
η=0.25
𝑤 = −0.05, 𝑤 = −0.02, 𝑤 = 0.02
Input (0,1)
−0.05 × 1 + −0.02 × 0 + 0.02 × 1 = −0.03
𝑖𝑓 𝑤 𝑥 ≥ 0 𝑛𝑒𝑢𝑟𝑜𝑛 𝑓𝑖𝑟𝑒𝑠
𝑤 𝑥 = 0 𝑜𝑛 𝑡ℎ𝑒 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦
Perceptron
c2
c4
c3 c1
XOR function
x2
x1
Neural Networks
x2
x1
Neural Network (Classification)
{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}
{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}
{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}
( ) ( )
𝑧 ( ) =𝜃 𝑥 +𝜃 𝑥 + 𝜃 ( )𝑥 + 𝜃 ( )
𝑥
𝑎 ( ) = 𝑔(𝑧 ( ) )
Neural Network (Classification)
𝑎 ( )
{ 𝑥 ( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}
( ) ( ) ( )
𝑧 ( ) =𝜃 𝑎 ( ) +𝜃 𝑎 ( ) +𝜃 𝑎 ( ) + 𝜃 ( )𝑎 ( ) +𝜃 ( )
𝑎 ( ) +𝜃 ( )
𝑎 ( )
𝑎 ( ) = 𝑔(𝑧 ( ) )
Neural Network (Classification)
𝑦 ∈ ℝ , 𝐾 output units
Logistic Regression: Cost function
ℎ ∈ℝ (ℎ (𝑥)) = 𝑘 output
𝐽 𝜃
1
=− 𝑦 log ℎ 𝑥 + (1 − 𝑦 ) log 1 − ℎ 𝑥
𝑚
λ ()
+ (𝜃 )
2𝑚
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( )
−𝑦
()
𝑎
𝑥 () 𝑎
( )
−𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( ) −𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( )
𝑎
𝑥 () 𝑎
( )
𝑥 () ( )
𝑎
𝑥 () ( )
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( )
−𝑦
()
𝑎
𝑥 () 𝑎
( )
−𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( ) −𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
𝛿( )
= (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
Neural Network (Learning)
𝑎( ) 𝑎( )
𝑎( )
𝑎( )
𝑥 () ( ) −𝑦
()
𝑎
𝑥 () 𝑎
( ) −𝑦
()
𝑥 () ( ) ()
𝑎 −𝑦
𝑥 () ( )
−𝑦
()
𝑎
𝛿( ) 𝛿( ) 𝛿( )
𝛿( ) = (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
𝛿( )
= (𝜃 ( ) ) 𝛿 ( ) .∗ 𝑔 (𝑧 ( ) ) 𝑔 𝑧 = 𝑎( ) .∗ (1 − 𝑎( ) )
𝜕 ( )𝛿 ( )
𝐽 𝜃 =𝑎
𝜕𝜃
Algorithm
Initialization
- Initialize all the weights to small (positive and negative) random values
Training
Repeat
- { 𝑥( ), 𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( ) )}
()
- Set ∆ : = 0 for all l, i, j
- For k=1 to m
Forward Propagation:
- Compute the activation of each neuron j in the hidden layer 𝑎 ( ) for l=2,3,…,L
Backward Propagation:
- Compute the error at the output 𝛿 ( )
=𝑦 − 𝑎( )
- Compute 𝛿 ,𝛿 , … , 𝛿( )
- ∆ ()
≔∆ +𝑎 ()
𝛿( )
() () ()
- 𝐷 ≔ ∆ + λ𝜃 if 𝑗 ≠ 0
() () 𝜕
- 𝐷 ≔ ∆ if 𝑗 = 0 (𝑗 = 0 is a bias term) ( 𝐽 𝜃 =𝐷 )
() ()
𝜕𝜃
- 𝜃 ≔𝜃 + 𝞰𝐷
Until learning stops
Recall
- Use Forward Propagation in the training phase
Expressive Capabilities of ANNS
Boolean functions:
Every Boolean function can be represented by network with single hidden layer
But might require exponential (in number of inputs) hidden units
Continuous functions:
Every bounded continuous function can be approximated with arbitrarily small
error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988]
Practical Issues