Tutorial on Neural Networks_18MAR2024
Tutorial on Neural Networks_18MAR2024
Mohammad Mundiwala
March 18 2024
Overview
Part 1: Forward Propagation
a) From input to output by Math
b) Implementing this in Python
• Neuron or Node
i. Linear Function z=w�x+b
ii. Activation Function a(z)
Let’s Discuss Shapes
Shape of Weights?
?×? ?×? Generalizing
- 4x4 Pixels 4×3
- Vectorize 3 × 16
?×1
𝑍𝑍 [1] = 𝑊𝑊 1 � 𝑋𝑋 + 𝐵𝐵[1]
16 × 1
𝑥𝑥𝑖𝑖 𝑍𝑍 [𝑘𝑘] = 𝑊𝑊 𝑘𝑘
� 𝐴𝐴[𝑘𝑘−1] + 𝐵𝐵[𝑘𝑘]
⋮
z=w�x+b 𝑥𝑥𝑛𝑛 𝑊𝑊𝑊 = 𝑚𝑚[1] × 𝑛𝑛
𝑊𝑊𝑊 = 𝑚𝑚[2] × 𝑚𝑚[1]
𝑊𝑊𝑊𝑊 = 𝑚𝑚[𝑘𝑘] × 𝑚𝑚[𝑘𝑘−1]
Activation Functions
ReLU
𝑓𝑓 𝑥𝑥 = max(0, 𝑥𝑥)
Sigmoid
1
𝑓𝑓 𝑥𝑥 =
1 + 𝑒𝑒 −𝑥𝑥
Softplus
𝑓𝑓 𝑥𝑥 = ln 1 + 𝑒𝑒 𝑥𝑥
Simple Network Architecture
𝑧𝑧1 𝑎𝑎1 For m neurons in a layer
…
X
𝑧𝑧𝑚𝑚 = 𝑤𝑤𝑚𝑚 � 𝑥𝑥 + 𝑏𝑏𝑚𝑚
X [2] [2]
𝑧𝑧1 𝑎𝑎1
1 1
Input 𝑧𝑧2 𝑎𝑎2
Output
[2][2]
𝑧𝑧2 𝑎𝑎2
1 1
𝑧𝑧3 𝑎𝑎3
Forward Prop By Hand
1 Hidden Layer & 2 Neurons
1-D
Given Layer 1 [1]
Input Data 𝑧𝑧1 = 3 � 0.5 − 2 = −0.5
Parameters
[1]
3 [1] 0.5 𝑧𝑧1 𝑎𝑎1 𝑎𝑎1 = ln 1 + 𝑒𝑒 −0.5 = 0.474
𝑤𝑤1
0.6 [1]
𝑏𝑏1 -2 Input
0.8
[1]
𝑤𝑤2 0.9
1 𝑧𝑧2 𝑎𝑎2 [1]
[1] 1 𝑧𝑧2 = 3 � 0.9 + 1 = 3.7
1.2 𝑏𝑏2
[1]
𝑎𝑎2 = ln 1 + 𝑒𝑒 3.7 = 3.72
Forward Prop By Hand
1 Hidden Layer & 2 Neurons
[1] [1]
1-D 𝑧𝑧1 = −0.5 𝑧𝑧2 = 3.7
Given Layer 2
Input Data
Parameters [1]
𝑎𝑎1 = 0.474
[1]
𝑎𝑎2 = 3.72
3 [2] 2 𝑧𝑧1 𝑎𝑎1
𝑤𝑤1
1 −1
[2] 0.5 𝑧𝑧1 𝑎𝑎1 𝑦𝑦� = ?
0 𝑏𝑏1
4 𝑧𝑧2 𝑎𝑎2
-2 [2]
𝑧𝑧1 = 0.474 � 2 + 3.72 � −1 + 0.5 = −2.27
2
𝑦𝑦�1 = 𝑎𝑎 𝑧𝑧1 = ln 1 + 𝑒𝑒 −2.27 = 0.098
Can We Do it with NumPy
• 100 points
• RNG noise
Model Architecture
a) 1 Hidden Layer
b) 2 Neurons
c) Sigmoid Activation
Define Hidden Layer Input
• # Step 1
• def hidden_layer_input(x , w1, w2, b1, b2):
• '''This function takes in the data point x
and creates a vector that is in the form
• [x_1_i, x_2_1] that correspond to the
neurons in the hidden layer
1. Training Data, Weights, Bias as input
•
• x_1_i = input*w1 + b1
• x_2_i = input*w2 + b2
2. Return Z
• '''
•
• z_1_i = x * w1 + b1
• 𝑍𝑍 [1] = 𝑊𝑊 1
� 𝑋𝑋 + 𝐵𝐵 [1]
• z_2_i = x * w2 + b2
•
• return np.array([z_1_i, z_2_i])
Define Output Function
def final_layer_output(fx1, fx2, w3, w4,
b3):
'''This will pass data from the 2nd
layer to last layer
As shown in Step 3 1. Outputs of Layer 1
return output
Our Model
Parameters Forward Pass
[1]
𝑤𝑤1 -1.123 # create Model Predictions to see how the model compares
model_predictions = []
[1]
𝑏𝑏1 0.167 for test in X:
HL_input_model = hidden_layer_input(test, w_1, w_2, b_1, b_2)
[1]
𝑤𝑤2 1.285
fx1_model, fx2_model = sigmoid(HL_input_model)
[1]
𝑏𝑏2 -1.496 output_model = final_layer_output(fx1_model, fx2_model, w_3, w_4, b_3)
model_predictions.append(output_model)
[2]
𝑤𝑤1 −0.841
2.693
[2]
𝑏𝑏1 0.016
Lets Predict
x = 1, y = 1.09
[1]
𝑤𝑤1 -1.123 •
[1]
𝑧𝑧1 = 1 � −1.123 + 0.167 = −0.956
[1]
𝑏𝑏1 0.167 [1] 1
• 𝑎𝑎1 = = 𝟎𝟎. 𝟐𝟐𝟐𝟐𝟐𝟐
1+𝑒𝑒 0.956
[1]
𝑤𝑤2 1.285
[1]
• 𝑧𝑧2 = 1 � 1.285 − 1.496 = −0.211
[1]
𝑏𝑏2 -1.496
[1] 1
• 𝑎𝑎2 = = 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒𝟒𝟒
1+𝑒𝑒 0.211
[2]
𝑤𝑤1 −0.841
2.693 • 𝑦𝑦� = 0.277 � −0.841 + 0.4474 � 2.693 + 0.016 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗
[2]
𝑏𝑏1 0.016
Results with NumPy
Random Parameters Trained Parameters
Back Propagation
Simple Optimization Problem
Define Loss Gradient Descent
• 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
• Whats the Derivative of Loss?
𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜
?
?
?
? 𝑦𝑦�
X
?
?
W A
Chain Rule!
[2]
• How does 𝑤𝑤1 effect the Loss? (assume Output has no activation function)
𝜕𝜕𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 12(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
[2]
=?
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑑𝑑
= 𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖
𝑑𝑑 𝑦𝑦�𝑖𝑖
𝜕𝜕𝐿𝐿 𝑑𝑑𝑑𝑑 𝑑𝑑𝑦𝑦�𝑖𝑖
[2]
= �
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑦𝑦�𝑖𝑖 𝑑𝑑𝑤𝑤 [2] 𝑦𝑦� = 𝑊𝑊 2
� 𝐴𝐴[1] + 𝐵𝐵 [2]
1
𝝏𝝏𝝏𝝏 𝑑𝑑𝑦𝑦�𝑖𝑖
[𝟐𝟐]
�𝒊𝒊 − 𝒚𝒚𝒊𝒊 � 𝑨𝑨[𝟏𝟏]
= 𝒚𝒚 [2]
= 𝐴𝐴[1]
𝝏𝝏𝒘𝒘𝟏𝟏 𝑑𝑑𝑤𝑤1
That’s just the Output
of 2nd to Last Layer!
A Little Harder Chain Rule!
[1]
• How does 𝑤𝑤1 effect the Loss? (assume Output has no activation function)
𝜕𝜕𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 12(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑎𝑎𝑖𝑖 = 𝑎𝑎(𝑧𝑧𝑖𝑖 )
[1]
=? 1 3 𝑑𝑑𝐴𝐴[1]
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑑𝑑 1
= 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷. 𝑜𝑜𝑜𝑜 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴
= 𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 [1]
𝑑𝑑 𝑦𝑦�𝑖𝑖 𝑑𝑑𝑧𝑧1
𝜕𝜕𝐿𝐿
[1]
𝑑𝑑𝑑𝑑 𝑑𝑑 𝑦𝑦�𝑖𝑖 𝑑𝑑𝐴𝐴1 𝑑𝑑𝑧𝑧1, 1
𝑎𝑎 𝑥𝑥 =
= 1 + 𝑒𝑒 −𝑥𝑥
𝜕𝜕𝑤𝑤1
[1] 𝑑𝑑𝑦𝑦�𝑖𝑖 𝑑𝑑𝐴𝐴[1] 𝑑𝑑𝑑𝑑1 𝑑𝑑𝑤𝑤 [1] 2 1 1
1 1 2 𝑎𝑎𝑎 𝑥𝑥 = −𝑥𝑥 � (1 − −𝑥𝑥 )
𝑦𝑦� = 𝑊𝑊 � 𝐴𝐴[1] + 𝐵𝐵[1] 1+𝑒𝑒 1+𝑒𝑒
𝑑𝑑 𝑦𝑦�𝑖𝑖 4
𝝏𝝏𝝏𝝏 𝟐𝟐
= 𝑊𝑊 [2] 𝑍𝑍 [1] = 𝑊𝑊 1
� 𝑋𝑋 + 𝐵𝐵[1]
[𝟏𝟏]
�𝒊𝒊 − 𝒚𝒚𝒊𝒊 � 𝑾𝑾
= 𝒚𝒚 � 𝑨𝑨′ (𝒙𝒙) � 𝑿𝑿 [1]
𝑑𝑑𝐴𝐴1
𝝏𝝏𝒘𝒘𝟏𝟏 𝑑𝑑𝑧𝑧1,
[1]
= 𝑋𝑋
𝑑𝑑𝑤𝑤
Notes for
Implementation
• Pass all points and then update
• Sum all data points’ loss
In Python
𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 (output - y_i)
𝑊𝑊 [2] w_3
w_4
1 1 def sig_prime(x):
� (1 − )
1+𝑒𝑒 −𝑥𝑥 1+𝑒𝑒 −𝑥𝑥 return sigmoid(x)*(1-sigmoid(x))
𝑋𝑋 x_i
𝜕𝜕𝐿𝐿
[1] dL_dw1 += (output - y_i) * w_3 * sig_prime(HL_input[0]) * x_i
𝜕𝜕𝑤𝑤1
Update Parameters
In Python
[1] [1] 𝜕𝜕𝜕𝜕
𝑊𝑊 = 𝑊𝑊 − 𝛼𝛼 � 1
𝜕𝜕𝑤𝑤
w_1 = w_1 - learning_rate * dL_dw1
[2] [2] 𝜕𝜕𝜕𝜕
𝑊𝑊1 = 𝑊𝑊1 − 𝛼𝛼 � [2]
𝜕𝜕𝑤𝑤
[1] [1] 𝜕𝜕𝜕𝜕
𝐵𝐵 = 𝐵𝐵 − 𝛼𝛼 � [1]
𝜕𝜕𝑏𝑏
𝜕𝜕𝜕𝜕 b_1 = b_1 - learning_rate * dL_db1
[2] [2]
𝐵𝐵 = 𝐵𝐵 − 𝛼𝛼 � [2]
𝜕𝜕𝐵𝐵
Update Random Weights
Forward Prop – Same as Before
[1]
𝑤𝑤1 -1 [1]
• 𝑧𝑧1 = 1 � −1 + 0.4 = −0.6
[1]
𝑏𝑏1 0.4
[1] 1
• 𝑎𝑎1 = = 𝟎𝟎. 𝟑𝟑𝟑𝟑4
[1]
𝑤𝑤2 1 1+𝑒𝑒 0.956
[1]
• 𝑧𝑧2 = 1 � 1 − 1.25 = −0.25
[1]
𝑏𝑏2 -1.25
[1] 1
• 𝑎𝑎2 = = 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒
1+𝑒𝑒 0.956
[2]
𝑤𝑤1 −0.75
• 𝑦𝑦� = 0.354 � −0.75 + 0.438 � 2.5 + 0.0 = 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖𝟖𝟖
2.5
[2]
𝑏𝑏1 0.0
Update Random Weights Pt.2
[1]
𝑤𝑤1 -1 Forward Prop Data 𝑥𝑥_𝑖𝑖 1
[1]
𝑏𝑏2 -1.25 [1] [1]
𝑊𝑊𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑊𝑊𝑜𝑜𝑜𝑜𝑜𝑜 − 𝛼𝛼 �
𝜕𝜕𝜕𝜕 weights based
𝜕𝜕𝑤𝑤
1
on 1. In practice
[2] −0.75
𝑤𝑤1
2.5
we will sum the
[1] −1 0.044 gradients and
[2] 0.0
𝑊𝑊𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = − 0.1 �
𝑏𝑏1 1 −0.14
then update the
[1] −1.044 weights.
𝑊𝑊𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 =
1.0
New Data
• Let’s Use Softplus
• 5000 Epochs
• Still 2 Nodes, 1 HL
• Note: Normalize the data for better
results
Fits the Data Well
Harder Data
Example Data
Apple Quality
Classification: 1
Good ⋮
Bad
0
Generalized Code
def init_params():
def forward_prop(W1, b1, W2, b2, X):
W1 = np.random.rand(20, 4) - 0.5
Z1 = W1.dot(X) + b1
b1 = np.random.rand(20, 1) - 0.5 A1 = ReLu(Z1)
W2 = np.random.rand(20, 20) - 0.5 Z2 = W2.dot(A1) + b2
b2 = np.random.rand(20, 1) - 0.5 A2 = sig(Z2)
return W1, b1, W2, b2
return Z1, A1, Z2,A2
Original Image
1 0 0 0 0 1 Dot Product
Filte r
1 1 0 0 1 1
3 1 1 1
1 0 1 1 0 1 1 0 0
3 1 1 1
1 0 0 0 0 1 1 0 0
3 0 1 1
1 0 0 0 0 1 1 0 0
3 0 0 0
1 0 0 0 0 1
Essence of Convolution
3 1 1 1 1 -1 -1 -1
3 1 1 1 + 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 1 -1 -1 -1
3 0 1 1 1 -2 -1 1
3 0 0 0 1 -2 -2 -2