Beginner's PyTorch Guide
Beginner's PyTorch Guide
by Dev G on September 03
First, I wanted to say thank you guys for all the positive feedback on the ML Challenge. I’m
glad you guys enjoyed the explanations and my teaching style.
For those who missed the first round, you can still join for free here.
PyTorch is a Deep Learning library, meaning that it’s used to build and train Neural
Networks.
At the end of this chapter, I’ll link a short video where I go over some practice questions.
We predict an output Y based on a few input attributes, like X₁, X₂, and X₃.
Y = W₁ ⋅ X₁ + W₂ ⋅ X₂ + W₃ ⋅ X₃ + b
W₁, W₂, W₃, and b are the parameters of the model, meaning that their values are updated
during training:
Let’s say X₁ represents a teenager’s current weight, X₂ their current height, and X₃ the
average of their parents’ heights.
Y will represent the model’s prediction for how tall this person will be, when finished
growing.
The goal of a Neural Network is to output an accurate prediction based on the input
attributes.
Each of the three nodes in the first column store an input attribute.
Next, each node in the middle column uses this equation to predict a number Y:
Y = W₁ ⋅ X₁ + W₂ ⋅ X₂ + W₃ ⋅ X₃ + b
But those nodes are EACH predicting an output, so we actually have 4 different Y values, Y₁
through Y₄.
Lastly, the final node. This node takes Y₁ through Y₄ as input, and also uses Linear
Regression to calculate an output Z (also referred to as ŷ in the diagram below).
Z = W₁ ⋅ Y₁ + W₂ ⋅ Y₂ + W₃ ⋅ Y₃ + W₄ ⋅ Y₄ + b
Z is the prediction for how tall someone will be, when finished growing.
And that’s all Neural Networks are! They’re just combinations of nodes that use Linear
Regression.
-----------
Okay, there is one detail left to discuss.
This detail is what makes Neural Networks different from Linear Regression.
Before passing Y₁ through Y₄ into the final node, we first pass each value into another
function, and the outputs of that function will be sent into the final node.
Why do we need to pass Y₁ through Y₄ into the Sigmoid before passing the values into the
final node?
Well . . . without a nonlinear function, the Neural Network can only learn simple, linear
relationships like the one below.
“Learning” is the process in which the parameters of the model (all the W’s and b’s) are
updated.
Without a nonlinear function, the Neural Network’s prediction will remain inaccurate, even
if we let the network learn for thousands of iterations.
Nonlinearities are essential to the success of Neural Networks! Since most data in the
world is nonlinear, we need to add functions like the Sigmoid into the model.
-----------
Hey everyone,
Today’s chapter may seem a bit abstract, since we’re going over the basic data types and
functions of PyTorch.
But I promise that learning these fundamentals will pay off tomorrow, when we go over
Neural Networks in PyTorch!
Tensors are useful for storing the inputs and outputs of Machine Learning models. Here’s
two simple examples.
PyTorch is a Python library, but many of the functions are internally written in C++.
These functions take advantage of parallel processing whenever possible, and are
optimized for speed.
So, whenever we’re dealing with Machine Learning data, tensors are the go-to.
-----------
-----------
These functions are used all the time when training models!
First, the reshape() function. We simply have to specify the input tensor and the new size.
The reshape() function is used all the time in CNNs for reshaping images!
Next, the sum() function. We have to carefully specify the dim parameter, depending on
whether we want to sum each row or column!
The sum() function is often used when we want to combine the predictions from two
models.
The cat() function is commonly used to stack tensors on top of each other or side by side!
One cool application of cat() is in Multimodal LLMs.
We want the model to process images and text, so at some point, inside the model, we have
to concatenate the image and text tensors!
Lastly, the mse_loss() function, which stands for Mean Squared Error Loss.
When calculating the error for a regression model such as:
Y = W₁ ⋅ X₁ + W₂ ⋅ X₂ + W₃ ⋅ X₃ + b
We use the Mean Squared Error function to calculate the error between the model’s
prediction Y and the “true” answer from the dataset.
This function first calculates the difference between the model’s prediction and the “true”
answer for every data point in our dataset.
Finally, the function averages all those values, returning one final number.
We square the differences to remove the sign (+ or -) from positive and negative errors!
Using mse_loss() is much faster than manually looping over all data points and calculating
the error, since this function takes advantage of parallel processing when possible.
Stay tuned for Chapters 3 and 4 of our challenge to see the function used in action!
-----------
P.S. If you want to jump ahead and learn more PyTorch, I created a 7 minute video covering
the basics of Neural Networks and how to build them in PyTorch.
Coding Neural Networks In PyTorch
by Dev G on September 03
Hey everyone,
If you need a refresher of Neural Networks, here is a quick review of the concepts from
Chapter 1.
If you need a refresher of basic PyTorch syntax, check out this intro to PyTorch video.
-----------
After the class is written, we can make an instance of the class and use the model to make
predictions.
Also, let’s say this model predicts how tall someone will be, once they’re finished growing.
The network takes in X₁ (a teenager’s current weight), X₂ (their current height), and X₃ (the
average of their parents’ heights).
The first is the __init__() function, also called the constructor. This is where we’ll define
the number of nodes in each column of the Neural Network diagram!
Next is the forward() function.
This is where we’ll pass in X₁ (a teenager’s current weight), X₂ (their current height), and X₃
(the average of their parents’ heights) and return the model’s prediction!
BTW, the input data point (X₁, X₂, and X₃) will be stored in an array x, which will have size 3.
It’s time to write the __init__() function and define the model.
We will make use of the built-in PyTorch class nn.Linear ! This class has its own forward()
function already defined, which we will later make use of.
In the above Neural Network diagram, we know that each node in the middle column uses
this equation:
Y = W₁ ⋅ X₁ + W₂ ⋅ X₂ + W₃ ⋅ X₃ + b
The model has to store W₁ through W₃ plus the constant b for each node. This is exactly
what nn.Linear will keep track of!
The first input to nn.Linear() is the number of nodes in the previous layer, and the
second input is the number of nodes in the current layer.
Let’s move on to the forward() function, which calculates the final output.
We call this the “forward” function, since we can imagine the data flowing from left to right
through the network.
The simplest way to write this function is to pass x into the middle_layer, and then pass
the output into the final_layer.
We can directly call the forward() method of each layer like this:
Or, we can use this syntax, since Python knows that we want to call the forward() method.
This option is preferred since it’s more concise:
-----------
Tomorrow, we’ll make an instance of the class we wrote, and we’ll train it on a dataset!
The code for training the model will bridge together many concepts, so you won’t want to
miss it.
-----------
If you want to review this material further, check out this video I created!
I also included timecodes in case you want to skip the Neural Networks Review & go
straight to PyTorch.
-----------
When using the PyTorch library, every model must inherit from (or in other words,
subclass) the parent class nn.Module.
The Python syntax for inheritance is shown above. Simply write the parent class in
parentheses.
If this part is unfamiliar to you, don’t worry! It’s actually not as important as the rest of this
chapter.
PyTorch Finale: Training Neural Networks
by Dev G on September 03
Hey everyone,
If you need a refresher of Neural Networks, here is a quick review of the concepts from
Chapter 1.
If you need a refresher of basic PyTorch syntax, check out this intro to PyTorch video.
We’re finally going to write the simple for loop that is used to train all modern ML models.
-----------
We’ll pick off right where we finished yesterday. Here is the Neural Network we
implemented:
This model takes in 3 input attributes (such as a person’s current weight, height, and the
average of their parents’ heights).
And it predicts a single number, such as how tall the person will eventually grow.
Take the time to understand this class! Today’s concepts build on top of yesterday’s.
If you want to review these required fundamentals, I highly recommend the Intro to
PyTorch video.
Okay, the first step in training the model is to make an instance of our class:
We can now use the model to get predictions!
Let’s pass in a simple data point, where the person’s current weight is 45 (kg), current
height is 1.5 (meters), and the average of parents’ heights is 1.7 (meters).
The model’s prediction for the final height is 13.7 meters! This makes no sense, but a poor
prediction is expected.
We haven’t trained the model yet, so the values for W₁, W₂, W₃, b, etc. are completely
random!
Gradient Descent is the most important algorithm in Machine Learning. It uses derivatives
to update the parameters and minimize the model’s error.
Click here for the most concise explanation of Gradient Descent you’ve ever seen.
I’ve created countless explanations of Gradient Descent, and with each iteration, I’ve made
the explanation more and more concise.
My most recent iteration is only three minutes long, and I can’t recommend the video
enough.
First, let’s get the model’s predictions using the current parameter values:
We are assuming that the dataset has been previously defined!
Another day, we’ll go over how to define the dataset. For now let’s assume that dataset is an
array of size N by 3, where N is the number of datapoints we have:
As a result, predictions will be an array of size N, where each entry is the model’s
prediction for a data point.
Next, let’s calculate the model’s error using the Mean Squared Error function, which we
went over on Day 2!
TLDR for Mean Squared Error: Calculate the difference between the model’s prediction and
the true answer for each data point, square all the differences, and then average them all
together to return one final number.
For now, let’s assume that true_answers is an array of size N. Each entry stores the final
height for a person in the dataset.
Next, let’s calculate all necessary derivatives for Gradient Descent with a simple call to
backward().
You may be wondering what in the world backward() is doing.
By this point, you should know that derivatives are necessary to use the Gradient Descent
formula and update the parameters.
But how do we actually calculate the derivatives required to update each parameter?
Backward() is like the reverse of forward(). Using the model’s prediction and the
corresponding error, calculations now flow from right to left through the network, in order
to calculate all the required derivatives.
We have to go from right to left for backward(). Why?
If you do the math by hand (don’t worry, this is not necessary), this is what you would find:
The derivatives that are needed to update the parameters in the hidden layer of the
network depend on the derivatives for the parameters in the output layer of the network.
So, we have to calculate the final layer’s derivatives first, and then use those to calculate the
middle layer’s derivatives.
This is why we call the function backward(). We’re going from right to left.
If that seemed a bit unclear, no worries. It’s the trickiest concept in Machine Learning,
and thankfully PyTorch will calculate all the derivatives for us.
However, it is essential to understand Gradient Descent, and the overall idea of using
derivatives to update the parameters and minimize the error function.
Let’s use the derivatives to update the parameters using this equation:
Thankfully, we can do this in PyTorch with two lines of code.
We simply need to pass in a list of the model parameters that the optimizer is responsible
for updating.
We’re using the SGD class which stands for Stochastic Gradient Descent, a form of
Gradient Descent.
Next, call the step() function which will use the Gradient Descent equation to update each
parameter.
The model is now trained!
If we try to get the model prediction for a sample data point, we will now get a much more
accurate prediction of 1.8 meters.
First, the code for the training loop is nearly identical whether we’re training a
height-prediction model, an LLM to generate text, or a vision model to classify images!
We get the model predictions, calculate derivatives, and use Gradient Descent to improve
the model predictions!
The only thing that varies is the Neural Network diagram and thus the class we define.
Next, there is actually one final line of code that we need for the training loop. I saved it for
the end of the chapter so that the explanation from earlier was as intuitive as possible.
Without this simple line of code, the training loop will fail. Here it is:
By default, PyTorch will store the derivatives from previous iterations and maintain a
running sum.
This is not what we want. Instead, we want to “reset” or “zero” the derivatives after every
iteration, so that the next iteration’s derivatives are not affected.
It’s a bit odd that PyTorch adds up all the derivatives by default, since this is not the typical
use case.