Understanding and Creating Neural Networks
Understanding and Creating Neural Networks
Towards AI
In this, and some following, blog posts, I will consolidate all that I have
learned as a way to give back to the community and help new entrants.
I will be creating common forms of neural networks all with the help of
nothing but NumPy.
This blog post is divided into two parts, the first part will be
understanding the basics of a neural network and the second part will
comprise the code for implementing everything learned from the first part.
. . .
Let’s dig in
Neural networks are a model inspired by how the brain works. Similar
to neurons in the brain, our ‘mathematical neurons’ are also,
intuitively, connected to each other; they take inputs(dendrites), do
some simple computation on them and produce outputs(axons).
The best way to learn something is to build it. Let’s start with a simple
neural network and hand-solve it. This will give us an idea of how the
computations flow through a neural network.
Fig.1 Simple input-output only neural network
As in the figure above, most of the time you will see a neural network
depicted in a similar way. But this succinct and simple looking picture
hides a bit of the complexity. Let’s expand it out.
Now, let’s go over each node in our graph and see what it represents.
. . .
These nodes represent our inputs for our first and second features, x₁
and x₂, that define a single example we feed to the neural network,
thus called “Input Layer”
Fig 4. Weights
Weights are the main values our neural network has to “learn”. So
initially, we will set them to random values and let the “learning
algorithm” of our neural network decide the best weights that result in
the correct outputs.
This node represents a linear function. Simply, it takes all the inputs
coming to it and creates a linear equation/combination out of them. ( By
convention, it is understood that a linear combination of weights and
inputs is part of each node, except for the input nodes in the input layer,
thus this node is often omitted in figures, like in Fig.1. In this example, I’ll
leave it in)
This σ node takes the input and passes it through the following
function, called the sigmoid function(because of its S-shaped curve),
also known as the logistic function:
Fig 7. Sigmoid(Logistic) function
. . .
Now that we know what each and everything represents let’s flex our
muscles by computing each node by hand on some dummy data.
Fig 8. OR gate
We’ll shortly see how our simple neural network performs this task.
Let’s go through all the computations our neural network will perform
on the given the first example, x₁=0, and x₂=0. Also, we’ll initialize
weights w₁ and w₂ to w₁=0.1 and w₂=0.6 (recall, these weights a
have been randomly selected)
Fig 10. Forward propagation of the first example from OR table data
With our current weights, w₁= 0.1 and w₂ = 0.6, our network’s
output is a bit far from where we’d like it to be. The predicted output,
ŷ, should be ŷ≈0 for x₁=0 and x₂=0, right now its ŷ=0.5.
So, how does one tell a neural network how far it is from our desired
output? In comes the Loss Function to the rescue.
Loss Function
The Loss Function is a simple equation that tells us how far our neural
network’s predicted output(ŷ) is from our desired output(y), for ONE
example, only.
. . .
Now that we know the purpose of a Loss function let’s calculate the
error in our current prediction ŷ=0.5, given y=0
as we can see the Loss is 0.125. Given this, we can now use the
derivative of the Loss function to check whether we need to increase or
decrease our weights.
. . .
To perform backpropagation we’ll employ the following technique: at
each node, we only have our local gradient computed(partial derivatives
of that node), then during backpropagation, as we are receiving
numerical values of gradients from upstream, we take these and multiply
with local gradients to pass them on to their respective connected nodes.
. . .
Since the backpropagation steps can seem a bit complicated I’ll go over
them step by step:
Fig 17.a. Backpropagation
. . .
For the next calculation, we’ll need the derivative of the sigmoid
function, since it forms the local gradient of the red node. Let’s derive
that.
Fig18. The derivative of the Sigmoid function.
. . .
Notice something weird? The derivatives to the Loss with respect to the
weights,w₁ & w₂, are ZERO! We can’t increase or decrease the weights
if their derivatives are zero. So then, how do we get our desired output
in this instance if we can’t figure out how to adjust the weights? The
key thing to note here is that the local gradients (∂z/∂w₁ and ∂z/∂w₂)
are x₁ and x₂, both of which, in this example, happens to be zero (i.e.
provide no information)
. . .
Bias
Recall equation of a line from your high school days.
Fig 19. Equation of a Line
Here b is the bias term. Intuitively, the bias tells us that all outputs
computed with x(independent variable) should have an additive bias of
b. So, when x=0(no information coming from the independent
variable) the output should be biased to just b.
Note that without the bias term a line can only pass through the origin(0,
0) and the only differentiating factor between lines would then be the
gradient m.
. . .
So, using this new information let’s add another node to a neural
network; the bias node. (In neural network literature, every layer, except
the input layer, is assumed to have a bias node, just like the linear node,
so this node is also often omitted in figures.)
Fig 22. Forward propagation of the first example from OR table data with a bias unit
Well, the forward propagation with a bias of “b=0” didn’t change our
output at all, but let’s do the backward propagation before we make
our final judgment.
Hurrah! we just figured out how much to adjust the bias. Since the
derivative of bias(∂L/∂b) is positive 0.125, we will need to adjust the
bias by moving in the negative direction of the gradient(recall the
curve of the Loss function from before). This is technically called
gradient descent, as we are “descending” away from the sloping
region to a flat region using the direction of the gradient. Let’s do that.
Now, that we’ve slightly adjusted the bias to b=-0.125, let’s test if
we’ve done the right thing by doing a forward propagation and
checking the new Loss.
Fig 25. Forward propagation with newly calculated bias
Now you may be wondering, this is only a small improvement from the
previous result and how do we get to the minimum Loss. Two things
come into play: a) how many iterations of ‘training’ we perform
(each training cycle is forward propagation followed by backward
propagation and updating the weights through gradient descent). b)
the learning rate.
. . .
Learning Rate
Recall, how we calculated the new bias, above, by moving in the
direction opposite of the gradient(i.e. gradient descent).
Fig 27. The equation for updating bias
Notice that when we updated the bias we moved 1 step in the opposite
direction of the gradient.
Learning rate defines how quickly we reach the minimum loss. Let’s
visualize below what the learning rate is doing:
So what’s the takeaway? Just set the learning rate as high possible and
reach the optimum loss quickly. NO. Learning rate can be a double-
edged sword. Too high a learning rate and the parameters
(weights/biases) don’t reach the optimum instead start to diverge
away from the optimum. To small a learning rate and the parameters
take too long to converge to the optimum.
Fig 31. Visualizing the effect of very low vs. very high learning rate.
In short, the goal is not the find the “perfect learning rate ” but instead
a learning rate large enough so that the neural network trains
successfully and efficiently without diverging.
. . .
So, far we’ve only used one example(x₁=0 and x₂=0) to adjust our
weights and bias(actually, only our bias up till now ) and that
reduced the loss on one example from our entire dataset(OR gate
table). But we have more than one example to learn from and we want
to reduce our loss across all of them. Ideally, in one training
iteration, we would like to reduce our loss across all the training
examples. This is called Batch Gradient Descent(or full batch
gradient descent), as we use the entire batch of training examples per
training iteration to improve our weights and biases. (Others forms are
mini-batch gradient descent, where we use a subset of the data set in
each iteration and stochastic gradient descent, where we only use one
example per training iteration as we’ve done so far).
A training iteration where the neural network goes through all the
training examples is called an Epoch. If using mini-batches than an epoch
would be complete after the neural network goes through all the mini-
batches, similarly for stochastic gradient descent where a batch is just one
example.
. . .
Cost Function
When we perform “batch gradient descent” we need to slightly change
our Loss function to accommodate not just one example but all the
examples in the batch. This adjusted Loss function is called the Cost
Function.
Also, note that the curve of the Cost Function is similar to the curve of the
Loss function(same U-Shape).
Intuitively, the Cost function is expanding out the capability of the Loss
function. Recall, how the Loss function was helping to minimize the
vertical distance between a single data point and the predictor line(z).
The Cost function is helping to minimize the vertical distance
(Squared Error Loss) between multiple data points, concurrently.
During batch gradient descent we’ll use the derivative of the Cost
function, instead of the Loss function, to guide our path to minimum
cost across all examples. (In some neural network literature, the Cost
Function is at times also represented with the letter ‘J’.)
Let’s take a look at how the derivative equation of the Cost function
differs from the plain derivative of the Loss function.
The derivative of Cost Function
Nothing new here in the calculation of the Cost. Just as expected the
Cost, in the end, is the average of the Loss, but the implementation is
now vectorized(we performed vectorized subtraction followed by
element-wise exponentiation, called Hadamard exponentiation). Let’s
derive the partial derivatives.
Fig 36. Calculation of Jacobian on the simple example
Fig 38. Comparison between the partial derivative of Loss and Cost with respect to(w.r.t) ŷ⁽ⁱ⁾
We’ll later see how this small change manifests itself in the calculation
of the gradient.
. . .
Let’s set up out data(X, W, b & Y) before doing forward and backward
propagation.
Fig 41. Setup data for vectorized computations.
(NOTE: all the results below are rounded to 3 decimal points, just for
brevity)
Fig 42. Vectorized Forward Propagation on OR gate dataset
How cool is that we calculated all the forward propagation steps for all
the examples in our data set in one go, just by vectorizing our
computations.
Our Cost with our current weights, W, turns out to be 0.089. Our Goal
now is to reduce this cost using backpropagation and gradient descent.
As before we’ll go through backpropagation in a step by step manner
Fig 44.a. Vectorized Backward on OR gate data
Fig 44.b. Vectorized Backward on OR gate data
Fig 44.c. Vectorized Backward on OR gate data
Let’s update the weights and bias, keeping learning rate same as the
non-vectorized implementation from before i.e. α=1.
Now that we have updated the weights and bias lets do a forward
propagation and calculate the new Cost to check if we’ve done the
right thing.
Fig 46. Vectorized Forward Propagation with updated weights and bias
Fig 48. Cost curve and Decision boundary after 5000 epochs
The Cost curve is basically the value of Cost plotted after a certain
number of iterations(epochs). Notice that the Cost curve flattens after
about 3000 epochs this means that the weights and bias of the neural
network have converged, so further training will only slightly improve
our weights and bias. Why? Recall the u-shaped Loss curve, as we
descend closer and closer the minimum point(flat region) the
gradients become smaller and smaller thus the steps gradient descent
takes are very small.
The Decision Boundary shows at the line along which the decision of
the neural network changes from one output to the other. We can
better visualize this by coloring the area below and above the decision
boundary.
Fig 49. Decision boundary visualized after 5000 epochs
This makes it much clearer. The red shaded area is the area below the
decision boundary and everything below the decision boundary has an
output( ŷ) of 0. Similarly, everything above the decision boundary,
shaded green, has an output of 1. In conclusion, our simple neural
network has learned a decision boundary by looking at the training
data and figuring out how to separate its two output classes(y=1 and
y=0) . Now the output neuron fires up (produces 1) whenever x₁
or x₂ or both are 1.
Now would be a good time to see how the “1/m” (“m” is the total
number of examples in the training dataset) in the Cost function
manifested in the final calculation of the gradients.
Fig 50. Comparing the effect of derivative w.r.t Cost and Loss on parameters of the neural network
From this, the most important point to know is that the gradient
that is used to update our weights, using the Cost function, is the
average of all the gradients calculated during a training iteration;
same applies to bias. You may want to confirm this yourself by
checking the vectorized calculations yourself.
Taking the average of all the gradients has some benefits. Firstly, it
gives us a less noisy estimate of the gradient. Second, the resultant
learning curve is smooth helping us easily determine if the neural
network is learning or not. Both of these features come in very handy
when training neural networks on much trickier datasets, such as those
with wrongly labeled examples.
. . .
From this example, we can generalize the following rule: Sum all the
incoming gradients to a node, from all the possible paths.
Let’s visualize how this rule is used in the calculation of the bias. Our
neural network can be seen as doing independent calculations for each of
our examples but using shared parameters for weights and bias, during
a training iteration. Below bias(b) is visualized as a shared parameter
for all individual calculations our neural network performs.
Fig 54. Visualizing bias parameter being shared across a training epoch.
Following the general rule defined above, we will sum all the incoming
gradients from all the possible paths to the bias node, b.
Fig 55. Visualizing all possible backpropagation paths to shared bias parameter
Since the ∂Z/∂b (local gradient at the Z node) is equal to 1, the total
gradient at b is the sum of gradients from each example with respect to
the Cost.
Now that we’ve got derivative of the bias figured out let’s move on to
derivative of weights, and more importantly the local gradient with
respect to weights.
Again, following the general rule defined above, we will sum all the
incoming gradients from all the possible paths to the weights node, W.
In our OR gate example we know that the gradient flowing into node Z
is a (1 × 4) matrix, Xₜᵣₐᵢₙ is a (2 × 4) matrix and the derivative of Cost
with respect to the W needs to be of the same size as W, which is (1 ×
2). So, the only way to generate a (1 × 2) matrix would be to take the
dot product of between Z and transpose of Xₜᵣₐᵢₙ.
Similarly, knowing that bias, b, is a simple (1 × 1) matrix and the
gradient flowing into node Z is (1 × 4), using dimension analysis we
can be sure that the gradient of Cost w.r.t b, also needs to be a (1 × 1)
matrix. The only way we can achieve this, given the local gradient
(∂Z/∂b) is just equal to 1, is by summing up the upstream gradient.
. . .
Figure 62, above, represents an XOR gate data. Looking at it note that
the label, y, is equal to 1 only when one of the values x₁ or x₂ is equal
to 1, not both. This makes it a particularly challenging dataset as the
data is not linearly separable, i.e. there is no single straight line
decision boundary that can successfully separate the two classes(y=1
and y=0) in the data. XOR used to be the bane of earlier forms of
artificial neural networks.
Recall that our current neural network was successful only because it
could figure out the straight line decision boundary that could
successfully separate the two classes of the OR gate dataset. A straight
line won’t cut it here. So, how do we get a neural network to figure this
one out?
Feature Engineering
Let’s look at a dataset similar looking to the XOR data that will help us
in making an important realization.
Fig 64. XOR-like data in different quadrants
The data in Figure 64 is exactly like the XOR data except each data
point is spread out in different quadrants. Notice that in the 1ˢᵗ and 3ʳᵈ
quadrant all the values are positive and in the 2ⁿᵈ and 4ᵗʰ all the
values are negative.
Why is that? In the 1ˢᵗ and 3ʳᵈ quadrants the signs of values are being
squared, while in the 2ⁿᵈ and 4ᵗʰ quadrants the values are a simple
product between a negative and positive number resulting in a
negative number.
Fig 66. Result of the product of features
Let’s add the new synthetic feature to our training data, Xₜᵣₐᵢₙ.
Fig 68. New training data
Given below is the first training iteration of the neural network, you
may go through the computations yourself and confirm them as they
make for a good exercise. Since we are already familiar with this
neural network architecture, I will not go through all the computations
in a step-by-step by step manner, as before.
After 5000 epochs, the learning curve, and the decision boundary look
as follows:
Fig 74. Learning Curve and Decision Boundary of the neural net with a feature cross
Fig 80. Cost of the neural net with one hidden layer after first forward propagation
Whew ! That was a lot, but it did a great deal to improve our
understanding. Let’s perform the gradient descent update:
Fig 82. Gradient descent update for the neural net with a hidden layer
After 5000 epochs the Cost steadily decreases to about 0.0009 and we
get the following Learning Curve and Decision Boundary:
Fig 84. Learning Curve and Decision boundary of the neural net with one hidden layer
Let’s also visualize where the decision of the neural network changes
from 0(red) to 1(green):
Fig 85. Shaded decision boundary of the neural net with one hidden layer
This shows that the neural network has in fact learned where to fire-up
(output 1) and where to lay dormant(output 0).
So, our last example was a “2 layer neural network” (one hidden +
output layer) and all the examples before it just a “1 layer neural
network” (output layer, only).
3- Why use Activation Functions?
Activation functions are nonlinear functions and add nonlinearity
to the neurons. The feature crosses are a result of stacking the
activation functions in hidden layers. The combination of a bunch of
activation functions thus results in a complex non-linear decision
boundary. In this blog, we used the sigmoid/logistic activation
function, but there are many other types of activation functions(ReLU
being a popular choice for hidden layers) each providing a certain
benefit. The choice of the activation function is also a hyper-
parameter when creating neural networks.
Fig 86. Showing that stacking linear layers/functions results in a linear layer/function
There are many ways to set weights randomly in neural networks. For
small neural networks, it is ok to set the weights to small random
values. For larger networks, we tend to use “Xavier” or “He”
initialization methods(will be in the coding section). Both these
methods still set weights to random values but control their variance.
For now, its suffice to say use these methods when the network does not
seem to converge and the Cost becomes static or reduces very slowly when
using the “plain” method of setting weights to small random values.
Weight initialization is an active research area and will be a topic for a
future “Nothing but Numpy” blog.
. . .
Let's first see the Linear Layer class. The constructor takes as
arguments: the shape of the data coming in( input_shape ), the
number of neurons the layer outputs( n_out ) and what type of
random weight initialization need to be performed
( ini_type=”plain” , default is “plain” which is just small random
gaussian numbers).
Now let's see the Sigmoid Layer class, its constructor takes in as an
argument the shape of data coming in( input_shape ) from a Linear
Layer preceding it.
Now we are ready to create our neural network. Let’s use the
architecture defined in Figure 77 for XOR data.
Running the loop in the notebook we see that the Cost decreases to
about 0.0009 after 4900 epochs
...
Cost at epoch#4600: 0.001018305488651183
Cost at epoch#4700: 0.000983783942124411
Cost at epoch#4800: 0.0009514180100050973
Cost at epoch#4900: 0.0009210166430616655
Fig 93. The Learning Curve, Decision Boundary, and Shaded Decision Boundary.
Make sure to check out the other notebooks in the repository. We’ll be
building upon the things we learned in this blog in future Nothing but
NumPy blogs, therefore, it would behoove you to create the layer
classes from memory as an exercise and try recreating the OR gate
example from Part Ⅰ.
. . .