0% found this document useful (0 votes)
12 views

Tutorial on Neural Networks_18MAR2024

The document provides an introduction to neural networks, covering key concepts such as forward and back propagation, activation functions, and model architecture. It includes mathematical explanations and Python implementations for building a simple neural network with one hidden layer and two neurons. Additionally, it discusses optimization techniques like gradient descent and the chain rule for calculating gradients during back propagation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Tutorial on Neural Networks_18MAR2024

The document provides an introduction to neural networks, covering key concepts such as forward and back propagation, activation functions, and model architecture. It includes mathematical explanations and Python implementations for building a simple neural network with one hidden layer and two neurons. Additionally, it discusses optimization techniques like gradient descent and the chain rule for calculating gradients during back propagation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Neural Network

Mohammad Mundiwala

March 18 2024
Overview
Part 1: Forward Propagation
a) From input to output by Math
b) Implementing this in Python

Part 2 : Back Propagation


a) How the network can “learn”
b) Implementing this in Python
Forward Propagation
Network Jargon
• Layer
i. Input
Weight Bias
ii. Hidden
iii. Output

• Neuron or Node
i. Linear Function z=w�x+b
ii. Activation Function a(z)
Let’s Discuss Shapes
Shape of Weights?
?×? ?×? Generalizing
- 4x4 Pixels 4×3
- Vectorize 3 × 16
?×1
𝑍𝑍 [1] = 𝑊𝑊 1 � 𝑋𝑋 + 𝐵𝐵[1]
16 × 1
𝑥𝑥𝑖𝑖 𝑍𝑍 [𝑘𝑘] = 𝑊𝑊 𝑘𝑘
� 𝐴𝐴[𝑘𝑘−1] + 𝐵𝐵[𝑘𝑘]

z=w�x+b 𝑥𝑥𝑛𝑛 𝑊𝑊𝑊 = 𝑚𝑚[1] × 𝑛𝑛
𝑊𝑊𝑊 = 𝑚𝑚[2] × 𝑚𝑚[1]
𝑊𝑊𝑊𝑊 = 𝑚𝑚[𝑘𝑘] × 𝑚𝑚[𝑘𝑘−1]
Activation Functions

ReLU
𝑓𝑓 𝑥𝑥 = max(0, 𝑥𝑥)
Sigmoid
1
𝑓𝑓 𝑥𝑥 =
1 + 𝑒𝑒 −𝑥𝑥
Softplus
𝑓𝑓 𝑥𝑥 = ln 1 + 𝑒𝑒 𝑥𝑥
Simple Network Architecture
𝑧𝑧1 𝑎𝑎1 For m neurons in a layer

X
𝑧𝑧𝑚𝑚 = 𝑤𝑤𝑚𝑚 � 𝑥𝑥 + 𝑏𝑏𝑚𝑚

Input 𝑧𝑧2 𝑎𝑎2 Output


For Softplus Activation …
𝑎𝑎 𝑥𝑥 = ln 1 + 𝑒𝑒 𝑥𝑥
𝑎𝑎𝑚𝑚 = 𝑎𝑎(𝑧𝑧𝑚𝑚 )
𝑧𝑧3 𝑎𝑎3
Multiple Layers Notation
1 1
𝑧𝑧1 𝑎𝑎1

X [2] [2]
𝑧𝑧1 𝑎𝑎1
1 1
Input 𝑧𝑧2 𝑎𝑎2
Output

[2][2]
𝑧𝑧2 𝑎𝑎2
1 1
𝑧𝑧3 𝑎𝑎3
Forward Prop By Hand
1 Hidden Layer & 2 Neurons
1-D
Given Layer 1 [1]
Input Data 𝑧𝑧1 = 3 � 0.5 − 2 = −0.5
Parameters
[1]
3 [1] 0.5 𝑧𝑧1 𝑎𝑎1 𝑎𝑎1 = ln 1 + 𝑒𝑒 −0.5 = 0.474
𝑤𝑤1
0.6 [1]
𝑏𝑏1 -2 Input
0.8
[1]
𝑤𝑤2 0.9
1 𝑧𝑧2 𝑎𝑎2 [1]
[1] 1 𝑧𝑧2 = 3 � 0.9 + 1 = 3.7
1.2 𝑏𝑏2
[1]
𝑎𝑎2 = ln 1 + 𝑒𝑒 3.7 = 3.72
Forward Prop By Hand
1 Hidden Layer & 2 Neurons
[1] [1]
1-D 𝑧𝑧1 = −0.5 𝑧𝑧2 = 3.7
Given Layer 2
Input Data
Parameters [1]
𝑎𝑎1 = 0.474
[1]
𝑎𝑎2 = 3.72
3 [2] 2 𝑧𝑧1 𝑎𝑎1
𝑤𝑤1
1 −1
[2] 0.5 𝑧𝑧1 𝑎𝑎1 𝑦𝑦� = ?
0 𝑏𝑏1
4 𝑧𝑧2 𝑎𝑎2
-2 [2]
𝑧𝑧1 = 0.474 � 2 + 3.72 � −1 + 0.5 = −2.27

2
𝑦𝑦�1 = 𝑎𝑎 𝑧𝑧1 = ln 1 + 𝑒𝑒 −2.27 = 0.098
Can We Do it with NumPy
• 100 points
• RNG noise
Model Architecture
a) 1 Hidden Layer
b) 2 Neurons
c) Sigmoid Activation
Define Hidden Layer Input
• # Step 1
• def hidden_layer_input(x , w1, w2, b1, b2):
• '''This function takes in the data point x
and creates a vector that is in the form
• [x_1_i, x_2_1] that correspond to the
neurons in the hidden layer
1. Training Data, Weights, Bias as input

• x_1_i = input*w1 + b1
• x_2_i = input*w2 + b2
2. Return Z
• '''

• z_1_i = x * w1 + b1
• 𝑍𝑍 [1] = 𝑊𝑊 1
� 𝑋𝑋 + 𝐵𝐵 [1]
• z_2_i = x * w2 + b2

• return np.array([z_1_i, z_2_i])
Define Output Function
def final_layer_output(fx1, fx2, w3, w4,
b3):
'''This will pass data from the 2nd
layer to last layer
As shown in Step 3 1. Outputs of Layer 1

x is in the form [f(x1), f(x2)] and 2. Weights, Bias of Final Layer


will need to be treated accordingly'''
3. No Activation function for output used

output = fx1*w3 + fx2*w4 + b3 𝑦𝑦� = 𝑍𝑍 [𝑘𝑘] = 𝑊𝑊 𝑘𝑘


� 𝐴𝐴[𝑘𝑘−1] + 𝐵𝐵 [𝑘𝑘]

return output
Our Model
Parameters Forward Pass
[1]
𝑤𝑤1 -1.123 # create Model Predictions to see how the model compares
model_predictions = []
[1]
𝑏𝑏1 0.167 for test in X:
HL_input_model = hidden_layer_input(test, w_1, w_2, b_1, b_2)
[1]
𝑤𝑤2 1.285
fx1_model, fx2_model = sigmoid(HL_input_model)
[1]
𝑏𝑏2 -1.496 output_model = final_layer_output(fx1_model, fx2_model, w_3, w_4, b_3)

model_predictions.append(output_model)
[2]
𝑤𝑤1 −0.841
2.693
[2]
𝑏𝑏1 0.016
Lets Predict
x = 1, y = 1.09
[1]
𝑤𝑤1 -1.123 •
[1]
𝑧𝑧1 = 1 � −1.123 + 0.167 = −0.956
[1]
𝑏𝑏1 0.167 [1] 1
• 𝑎𝑎1 = = 𝟎𝟎. 𝟐𝟐𝟐𝟐𝟐𝟐
1+𝑒𝑒 0.956
[1]
𝑤𝑤2 1.285
[1]
• 𝑧𝑧2 = 1 � 1.285 − 1.496 = −0.211
[1]
𝑏𝑏2 -1.496
[1] 1
• 𝑎𝑎2 = = 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒𝟒𝟒
1+𝑒𝑒 0.211
[2]
𝑤𝑤1 −0.841
2.693 • 𝑦𝑦� = 0.277 � −0.841 + 0.4474 � 2.693 + 0.016 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗
[2]
𝑏𝑏1 0.016
Results with NumPy
Random Parameters Trained Parameters
Back Propagation
Simple Optimization Problem
Define Loss Gradient Descent
• 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
• Whats the Derivative of Loss?

𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜
?
?
?
? 𝑦𝑦�
X
?
?
W A
Chain Rule!
[2]
• How does 𝑤𝑤1 effect the Loss? (assume Output has no activation function)
𝜕𝜕𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 12(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
[2]
=?
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑑𝑑
= 𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖
𝑑𝑑 𝑦𝑦�𝑖𝑖
𝜕𝜕𝐿𝐿 𝑑𝑑𝑑𝑑 𝑑𝑑𝑦𝑦�𝑖𝑖
[2]
= �
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑦𝑦�𝑖𝑖 𝑑𝑑𝑤𝑤 [2] 𝑦𝑦� = 𝑊𝑊 2
� 𝐴𝐴[1] + 𝐵𝐵 [2]
1

𝝏𝝏𝝏𝝏 𝑑𝑑𝑦𝑦�𝑖𝑖
[𝟐𝟐]
�𝒊𝒊 − 𝒚𝒚𝒊𝒊 � 𝑨𝑨[𝟏𝟏]
= 𝒚𝒚 [2]
= 𝐴𝐴[1]
𝝏𝝏𝒘𝒘𝟏𝟏 𝑑𝑑𝑤𝑤1
That’s just the Output
of 2nd to Last Layer!
A Little Harder Chain Rule!
[1]
• How does 𝑤𝑤1 effect the Loss? (assume Output has no activation function)
𝜕𝜕𝐿𝐿 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 12(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑎𝑎𝑖𝑖 = 𝑎𝑎(𝑧𝑧𝑖𝑖 )
[1]
=? 1 3 𝑑𝑑𝐴𝐴[1]
𝜕𝜕𝑤𝑤1 𝑑𝑑𝑑𝑑 1
= 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷. 𝑜𝑜𝑜𝑜 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴
= 𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 [1]
𝑑𝑑 𝑦𝑦�𝑖𝑖 𝑑𝑑𝑧𝑧1
𝜕𝜕𝐿𝐿
[1]
𝑑𝑑𝑑𝑑 𝑑𝑑 𝑦𝑦�𝑖𝑖 𝑑𝑑𝐴𝐴1 𝑑𝑑𝑧𝑧1, 1
𝑎𝑎 𝑥𝑥 =
= 1 + 𝑒𝑒 −𝑥𝑥
𝜕𝜕𝑤𝑤1
[1] 𝑑𝑑𝑦𝑦�𝑖𝑖 𝑑𝑑𝐴𝐴[1] 𝑑𝑑𝑑𝑑1 𝑑𝑑𝑤𝑤 [1] 2 1 1
1 1 2 𝑎𝑎𝑎 𝑥𝑥 = −𝑥𝑥 � (1 − −𝑥𝑥 )
𝑦𝑦� = 𝑊𝑊 � 𝐴𝐴[1] + 𝐵𝐵[1] 1+𝑒𝑒 1+𝑒𝑒

𝑑𝑑 𝑦𝑦�𝑖𝑖 4
𝝏𝝏𝝏𝝏 𝟐𝟐
= 𝑊𝑊 [2] 𝑍𝑍 [1] = 𝑊𝑊 1
� 𝑋𝑋 + 𝐵𝐵[1]
[𝟏𝟏]
�𝒊𝒊 − 𝒚𝒚𝒊𝒊 � 𝑾𝑾
= 𝒚𝒚 � 𝑨𝑨′ (𝒙𝒙) � 𝑿𝑿 [1]
𝑑𝑑𝐴𝐴1
𝝏𝝏𝒘𝒘𝟏𝟏 𝑑𝑑𝑧𝑧1,
[1]
= 𝑋𝑋
𝑑𝑑𝑤𝑤
Notes for
Implementation
• Pass all points and then update
• Sum all data points’ loss
In Python
𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 (output - y_i)
𝑊𝑊 [2] w_3
w_4
1 1 def sig_prime(x):
� (1 − )
1+𝑒𝑒 −𝑥𝑥 1+𝑒𝑒 −𝑥𝑥 return sigmoid(x)*(1-sigmoid(x))
𝑋𝑋 x_i
𝜕𝜕𝐿𝐿
[1] dL_dw1 += (output - y_i) * w_3 * sig_prime(HL_input[0]) * x_i
𝜕𝜕𝑤𝑤1
Update Parameters
In Python
[1] [1] 𝜕𝜕𝜕𝜕
𝑊𝑊 = 𝑊𝑊 − 𝛼𝛼 � 1
𝜕𝜕𝑤𝑤
w_1 = w_1 - learning_rate * dL_dw1
[2] [2] 𝜕𝜕𝜕𝜕
𝑊𝑊1 = 𝑊𝑊1 − 𝛼𝛼 � [2]
𝜕𝜕𝑤𝑤
[1] [1] 𝜕𝜕𝜕𝜕
𝐵𝐵 = 𝐵𝐵 − 𝛼𝛼 � [1]
𝜕𝜕𝑏𝑏
𝜕𝜕𝜕𝜕 b_1 = b_1 - learning_rate * dL_db1
[2] [2]
𝐵𝐵 = 𝐵𝐵 − 𝛼𝛼 � [2]
𝜕𝜕𝐵𝐵
Update Random Weights
Forward Prop – Same as Before
[1]
𝑤𝑤1 -1 [1]
• 𝑧𝑧1 = 1 � −1 + 0.4 = −0.6
[1]
𝑏𝑏1 0.4
[1] 1
• 𝑎𝑎1 = = 𝟎𝟎. 𝟑𝟑𝟑𝟑4
[1]
𝑤𝑤2 1 1+𝑒𝑒 0.956

[1]
• 𝑧𝑧2 = 1 � 1 − 1.25 = −0.25
[1]
𝑏𝑏2 -1.25
[1] 1
• 𝑎𝑎2 = = 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒
1+𝑒𝑒 0.956
[2]
𝑤𝑤1 −0.75
• 𝑦𝑦� = 0.354 � −0.75 + 0.438 � 2.5 + 0.0 = 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖𝟖𝟖
2.5
[2]
𝑏𝑏1 0.0
Update Random Weights Pt.2
[1]
𝑤𝑤1 -1 Forward Prop Data 𝑥𝑥_𝑖𝑖 1

[1] 0.4 [1] 𝑦𝑦_𝑖𝑖 1.09


𝑏𝑏1 • 𝑧𝑧1 = −0.6
[1]
𝑤𝑤2 1 [1] 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 0.8295
• 𝑎𝑎1 = 𝟎𝟎. 𝟑𝟑𝟑𝟑4
[1]
𝑏𝑏2 -1.25 𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 -0.2605
[1]
• 𝑧𝑧2 = −0.25
[2] −0.75 𝑤𝑤1
[2] −0.75
𝑤𝑤1
2.5 [1] 2.5
• 𝑎𝑎2 = 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒
𝑎𝑎𝑎(𝑧𝑧1 )
[1] (0.354)(1-0.354) = 0.228
[2]
𝑏𝑏1 0.0
• 𝑦𝑦� = 𝟎𝟎. 𝟖𝟖𝟖𝟖𝟖𝟖𝟖𝟖
𝑋𝑋 1

𝝏𝝏𝝏𝝏 −𝟎𝟎. 𝟕𝟕𝟕𝟕 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎


-0.2605* *0.228*1 =
[𝟏𝟏]
𝝏𝝏𝒘𝒘𝟏𝟏 𝟐𝟐. 𝟓𝟓 −𝟎𝟎. 𝟏𝟏𝟏𝟏
Update Random Weights Pt.2
[1]
𝑤𝑤1 -1
𝜕𝜕𝜕𝜕 −0.044
[1]
𝑏𝑏1 0.4 =
𝜕𝜕𝑤𝑤1
[1] 0.14 In this example
1
we updated the
[1]
𝑤𝑤2

[1]
𝑏𝑏2 -1.25 [1] [1]
𝑊𝑊𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑊𝑊𝑜𝑜𝑜𝑜𝑜𝑜 − 𝛼𝛼 �
𝜕𝜕𝜕𝜕 weights based
𝜕𝜕𝑤𝑤
1
on 1. In practice
[2] −0.75
𝑤𝑤1
2.5
we will sum the
[1] −1 0.044 gradients and
[2] 0.0
𝑊𝑊𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = − 0.1 �
𝑏𝑏1 1 −0.14
then update the
[1] −1.044 weights.
𝑊𝑊𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 =
1.0
New Data
• Let’s Use Softplus

• 5000 Epochs

• Still 2 Nodes, 1 HL
• Note: Normalize the data for better
results
Fits the Data Well
Harder Data
Example Data
Apple Quality

Features: −2.512 ⋯ −1.3512


 Weight 5.346 ⋯ −1.738
 Sweetness −1.012 ⋯ −0.3426
 Crunchiness −0.4915 ⋯ 2.6216
 Acidity Data from American Agricultural Company

Classification: 1
 Good ⋮
 Bad
0
Generalized Code
def init_params():
def forward_prop(W1, b1, W2, b2, X):
W1 = np.random.rand(20, 4) - 0.5
Z1 = W1.dot(X) + b1
b1 = np.random.rand(20, 1) - 0.5 A1 = ReLu(Z1)
W2 = np.random.rand(20, 20) - 0.5 Z2 = W2.dot(A1) + b2
b2 = np.random.rand(20, 1) - 0.5 A2 = sig(Z2)
return W1, b1, W2, b2
return Z1, A1, Z2,A2

def back_prop(Z1, A1, Z2, A2, W2, X, Y):


m = Y.size def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2,
dZ2 = (A2 - Y) alpha):
dW2 = 1/m * dZ2.dot(A1.T) W1 = W1 - alpha*dW1
db2 = 1/m * np.sum(dZ2) b1 = b1 - alpha*db1
W2 = W2 - alpha*dW2
dZ1 = W2.T.dot(dZ2)*ReLu_prime(Z1) b2 = b2 - alpha*db2
dW1 = 1/m * dZ1.dot(X.T)
db1 = 1/m * np.sum(dZ1) return W1, b1, W2, b2

return dW1, db1, dW2, db2


Convolutional NN
Image Classification seems
expensive…

1. Reduce number of input


Nodes
2. Tolerate small ‘shifts’ or
changes in an image
3. Take advantage of
correlations seen in the
image 6 x 6 Pixel
Essence of Convolution

Original Image

1 0 0 0 0 1 Dot Product
Filte r
1 1 0 0 1 1
3 1 1 1
1 0 1 1 0 1 1 0 0
3 1 1 1
1 0 0 0 0 1 1 0 0
3 0 1 1
1 0 0 0 0 1 1 0 0
3 0 0 0
1 0 0 0 0 1
Essence of Convolution

3 1 1 1 1 -1 -1 -1
3 1 1 1 + 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 1 -1 -1 -1
3 0 1 1 1 -2 -1 1
3 0 0 0 1 -2 -2 -2

- Pooling is used to further condense the data


- Goal: Train a model on a summary

You might also like