0% found this document useful (0 votes)
20 views10 pages

HW 5

Homework 5 for CME 213 involves implementing a GPU-based neural network class and its functionalities, including forward and backward passes, loss calculation, and optimizer steps. Students are required to deliver code implementations for various functions in 'neural_network.cpp' and analyze performance using Nsight profiling. The project utilizes the MNIST dataset for training and testing the neural network's classification capabilities.

Uploaded by

saberwu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

HW 5

Homework 5 for CME 213 involves implementing a GPU-based neural network class and its functionalities, including forward and backward passes, loss calculation, and optimizer steps. Students are required to deliver code implementations for various functions in 'neural_network.cpp' and analyze performance using Nsight profiling. The project utilizes the MNIST dataset for training and testing the neural network's classification capabilities.

Uploaded by

saberwu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CME 213, Introduction to parallel computing

Eric Darve
Spring 2025

Homework 5

Total number of points: 100.

Problem 1 Neural Network on the GPU


Similar to how you implemented a class DeviceMatrix to represent matrices on the GPU in Homework 2,
you will implement a class to represent neural networks on the GPU. The purpose of these abstractions is
to make it easier to write and debug code. All information on the neural network architecture can be found
in section C.
In this homework, you will implement a class DeviceNeuralNetwork.

To simplify backward calculations, assume that the neural network has two layers.
Question 1.1
(5 points) Implement the constructor for the class DeviceNeuralNetwork in neural_network.cpp.

Deliverables:

1. Code: The completed class DeviceNeuralNetwork in neural_network.cpp.

Question 1.2
(5 points) Implement the member function DeviceNeuralNetwork::to_cpu.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::to_cpu in neural_network.cpp.

Question 1.3
(5 points) Implement the constructor for the class GPUGrads.

Deliverables:

1. Code: The completed constructor GPUGrads::GPUGrads in neural_network.cpp.

Problem 2 Forward Pass


Question 2.1
(30 points) Implement the member function DeviceNeuralNetwork::forward in neural_network.cpp. We
will test your implementation by comparing the cached outputs of each layer with the cached outputs of
the CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::forward in neural_network.cpp.

1
Problem 3 Loss
Question 3.1
(10 points) Implement the member function DeviceNeuralNetwork::loss in neural_network.cpp. We
will test your implementation by comparing its output with the output of the CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::loss in neural_network.cpp.

Problem 4 Backward Pass


Question 4.1
(30 points) Implement the member function DeviceNeuralNetwork::backward in neural_network.cpp.
You can use the wrapper function TiledGEMM in gpu_func.cu to perform the GEMM operation

C ← αAB + βC

when either A or B is transposed. You are welcome to use or modify your own GEMM implementation
from Homework 4. Note that only neural_network.cpp will be submitted for grading. We will test your
implementation by comparing the cached gradients with respect to each parameter with the cached gradients
of CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::backward in neural_network.cpp.

Problem 5 Optimizer Step


Question 5.1
(5 points) Implement the member function DeviceNeuralNetwork::step in neural_network.cpp. We will
test your implementation by comparing the updated parameters with the updated parameters of the CPU
implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::step in neural_network.cpp.

Problem 6 Profiling with Nsight


Question 6.1
(10 points) Use Nsight to profile a training loop for the neural network that we have provided. To do
so, after studying the training loop main_q6.cpp (it is already implemented), profile main_q6 using the
commands provided in sbatch [Link]. For reference, the command is:

nsys profile --trace=cuda,nvtx,osrt --output=mainq6_profile --force-overwrite=true main_q6

This will produce a file, mainq6_profile.nsys-rep, that you will need to download and view in Nsight
systems in your local machine.

2
Study the profile output and identify parts of the training loop that can be changed to improve perfor-
mance. Look for improvements both in the training loop code that we have provided and the code you
wrote in Problems 1-5 that are used in the loop. Include screenshots of relevant parts of the profiling output
in your answer.

Deliverables:

1. Writeup: Comments and analysis of your Nsight output with screenshots to justify.

References
[1] Yann LeCun et. al. MNIST. [Link] [Online].

A Running the Code


1. We have provided a script [Link] that compiles the code and runs all tests. You can run it on the
cluster using sbatch [Link].

2. You can compile the code using make or


DOUBLE_FLAG=-DUSE_DOUBLE make
to create executables that use double-precision floating point numbers.

3. You can add additional tests with different size configurations to make sure your code runs correctly,
but we only require neural_network.cpp for submission.

B Submission instructions
1. For all questions that require explanations and answers besides source code, put those explanations
and answers in a separate single PDF file. Upload this file on Gradescope.

2. Submit your code by uploading a zip file on Gradescope. Here is the list of files we are expecting:

neural_network.cpp

We will not evaluate any code in files not listed above. Make sure to keep all file names as they are.

C Neural Networks on CUDA


Neural networks are widely used in machine learning problems, specifically in the domains of image
processing, computer vision, and natural language processing. There is a flurry of research projects on
deep learning, which uses more advanced variants of the simpler neural network we cover here. Therefore,
being able to train neural networks efficiently is important and is the goal of this project.

3
Figure 1: Examples of MNIST digits.

Data: MNIST
We will be using the MNIST [1] dataset, which consists of 28 × 28 greyscale images of handwritten digits
from 0 to 9. Some examples from this dataset are shown in Figure 1.
The dataset is divided into a training set of 60,000 images and a test set of 10,000 images. We will use the
training set to optimize the parameters of our neural network and we will use the unseen test set to measure
the performance of the trained network. We denote the ith example in the training set by (x(i) , y (i) ), where
x(i) denotes the image and y (i) denotes the corresponding class label (i.e. the digit shown in the image x(i) ).

Model: Neural Networks


Neurons
To describe neural networks we begin by describing the simplest neural network, which comprises a single
neuron.
x1
hW,b (x)
x2

x3

Figure 2: A single neuron.

The neuron illustrated in Figure 2 is a computational unit that takes as input x = (x1 , x2 , x3 ) and
outputs
X3 
hW,b (x) = f (W x + b) = f Wi xi + b ,
i=1
where f : R → R is some non-linear activation function, W is the weight of the neuron, and b is the bias of
the neuron. The row vector W and the scalar b are referred to as the parameters of the neuron, and the
output of the neuron is referred to as its activation.
In this project, we let f be the sigmoid function given by
1
f (z) = σ(z) = .
1 + exp(−z)
The derivative of the sigmoid function with respect to its input is
∂σ(x) 1 ∂ exp(−x) exp(−x)
=− 2
= = σ(x)(1 − σ(x));
∂x (1 + exp(−x)) ∂x (1 + exp(−x))2

4
we will use this fact repeatedly in the following sections. Other common activation functions include
f (z) = tanh(z) and the rectified linear unit (ReLU) f (z) = max(0, z). These are illustrated in Figure 3.

4 y

x
−4 −2 2 4

−1

tanh(x) Sigmoid ReLU

Figure 3: Examples of three activation functions: tanh(x), 1/(1 + exp(−x)) (sigmoid), and the rectified
linear unit (ReLU).

A single neuron can be trained to perform the task of binary classification. Consider the example of
cancer detection, where the task is to classify a tumor as benign or malignant. We can provide as input
x = ( size of tumor, location of tumor, length of time for which the tumor has existed), and if the label is
(
1 the tumor is malignant
y= ,
0 the tumor is benign

we can say that the neuron predicts that the tumor is malignant if and only if f (W x + b) > 0.5.
Since the value of f (W x + b) depends on the sign of W x + b, the neuron effectively partitions the input
space R3 using a 2-dimensional hyperplane. On one side of the hyperplane we have f (W x + b) > 0.5, and
on the other side of the hyperplane we have f (W x + b) < 0.5. Through an optimization process referred
to as training, we want to find values of the parameters W and b such that the hyperplane represented by
the neuron is as close as possible to the ‘true’ hyperplane.
More generally, we want to find values of the parameters W and b such that the network’s predictions
are ‘good’ on an unseen test set, since this would imply that our choice of model (here, a neuron with
certain values of W and b) is close to the ‘true’ model corresponding to reality.
It is insufficient to observe good predictions on the training set. Sufficiently complex networks can be
trained to make perfect predictions on the training set but they perform much worse on unseen data that
they were not trained on, implying that the trained model is not close to the ‘true’ model.
In this project, we would like to train a neural network to perform multi-class classification rather than
binary classification. Instead of simply predicting true or false, we would like the network we train to be
able to accurately predict which of 10 different digits is shown in the input image.

5
Layer 0 Layer 1 Layer 2

x1
ŷ1
x2
ŷ2
x3
ŷ3
x4

Figure 4: Fully connected feedforward neural network with two layers.

Fully connected feedforward neural network


Figure 4 shows a fully connected feedforward neural network with an input layer, one hidden layer, and an
output layer. Such a network is referred to as a ‘two-layer fully connected feedforward neural network’ or
a ‘two-layer multilayer perceptron (MLP)’.
The input layer is not counted since a neuron in the input layer performs no computation. For example,
the first neuron in the input layer takes as input x1 and outputs x1 . As this output travels along the edge
connecting the first neuron in the input layer to the first neuron in the hidden layer, it is multiplied by the
weight W1 of the first neuron in the hidden layer. Once it reaches the first neuron in the hidden layer, it
is added to the bias b of the first neuron in the hidden layer, and the result is passed through the sigmoid
function to obtain the activation of the first neuron in the hidden layer.
This process must be repeated for each element of the input vector x ∈ Rd×1 and for each of H1 neurons
in layer 1. An efficient way to do this is to use matrix-multiplication and compute a(1) = f (W (1) x + b(1) ),
(1)
where W (1) ∈ RH1 ×d , b(1) ∈ RH1 ×1 , and f is the sigmoid function. The element Wij is the jth weight of
(1)
the ith neuron in layer 1, the element bi is the bias of the ith neuron in the layer 1, and the (W (1) , b(1) )
are referred to as the parameters of layer 1. The vector a(1) ∈ RH1 ×1 is referred to as the activation of layer
1 and it consists of the activations of the neurons in layer 1.
The activation function of the output layer is special. Instead of each neuron independently applying the
sigmoid function to its input, all neurons in the output layer collectively compute softmax(W (2) a(1) + b(2) ).
If C denotes the number of possible class labels, we have C = 10 since there are 10 possible digits 0, . . . , 9.
For 1 ≤ i ≤ 10, using the softmax activation function allows us to interpret the ith element of the output
vector ŷ ∈ RC×1 as the neural network’s prediction of the probability that the digit in the input image is
digit i − 1.
In general, if Hi is the number of neurons in layer i, then the parameters of layer i are W (i) ∈ RHi ×Hi−1
and b(i) ∈ RHi ×1 . In Figure 4 we have d = H0 = 4, H1 = 5, H2 = 3 =⇒ W (1) ∈ R5×4 , b(1) ∈ R5×1 ,
W (2) ∈ R3×5 , and b(2) ∈ R3×1 . To efficiently process a batch ofd×Ninputs x1 , . . . , xN ∈ Rd×1 , we can stack
them horizontally to obtain a matrix X = x1 · · · xN ∈ R , and compute a batch of activations
(1) (1) (1) H ×N (1) (1) (1) ∈ RH1 ×N .

A = f (W X + B ) ∈ R 1 where B = b ··· b

6
Forward pass
The forward pass is the process of computing the activations of all neurons in the network for an (batch of)
input x. For a two-layer MLP, we compute

z (1) = W (1) x + b(1)


a(1) = σ(z (1) )
z (2) = W (2) a(1) + b(2)
ŷ = a(2) = softmax(z (2) )

The softmax function is defined by:


(2)
(2) def def exp(zj )
softmax(z )j = P (label = j|x) = C
P (2)
exp(zi )
i=1

This equation is saying that the probability that the input has label j (i.e., in our case, the digit j is
handwritten in the input image) is given by softmax(z (2) )j . Therefore, our predicted label for the input x is
given by:
label = argmax(ŷ)
This is basically the digit the network believes is written in the input image.

Loss
Recall that our objective is to learn the parameters of the neural network such that it gets the best accuracy
on the test set. Let y be the one-hot vector denoting the class of the input; yc = 1 if c is the correct label
and yi = 0 for all i ̸= c. We want P (label = c|x) to be the highest (e.g., close to 1).
Without going into the mathematical details, we will use the following general expression to determine
the error of our neural network. This expression turns out to be the most convenient for our purpose:
C
X
CE(y, ŷ) = − yi log (ŷi )
i=1

CE stands for cross-entropy. Since y is a one-hot vector, this simplifies to

CE(y, ŷ) = − log (ŷc )

We can observe that CE is 0 when we have the optimal answer ŷc = 1. Similarly, CE is maximal (+∞)
when ŷc is 0. This corresponds to a neural network that is “sure” that the digit is not c (maximally wrong).
The total cost for N input data points (such that the cross-entropy of the ith training vector is denoted
as CE(i) ) is:
N
1 X (i)
cost = J(W, b; x, y) = CE (y, ŷ)
N
i=1

The above cost measures the error, i.e. our “dissatisfaction”, with the output of the network. The more
certain the network is about the correct label (high P (y = c|x)), the lower our cost will be.
Clearly, we should choose the parameters that minimize this cost. This is an optimization problem, and
may be solved using the method of Stochastic Gradient Descent (described below).

7
Our neural network applies a non-linear function to the input because of the sigmoid and softmax
functions. When optimizing the neural network, we often add a penalization term for the magnitude of W
in order to control the non-linearity of the network. If we make W smaller, the network becomes ‘more
linear’ since W x ≈ σ(W x) when W x ≈ 0. Despite the possibility of making W too small and the fact
that there is no rigorous justification for this penalization, it is found to work well in practice. With the
penalization term, the cost function becomes

N
1 X λ
J(W, b; x, y) = CE (i) (y, ŷ) + ∥W ∥22 (1)
N 2
i=1

where ∥W ∥22 is the sum of the l2 -norm of all the weights W of the network, and λ is a hyperparameter that
needs to be tuned for best performance. In our implementation, only the weights W are penalized, not the
biases b.

Backward Pass
The backward pass is the process of using the chain rule to compute ∇p J, the gradient of the loss function
with respect to each parameter of the neural network. This process is also referred to as backpropagation,
since gradients are ‘propagated backward’ through the network using the chain rule.
Let’s compute the gradient for the parameters in the last layer (2) of our network:
h P i
(2) C (2)

" #
∂CE(y, ŷ) ∂
(2)
exp(zc ) ∂ z c log i=0 exp(z i
(2)
= − (2) log P (2)
=− (2)
C
∂zk ∂zk i=0 exp(zi ) ∂zk

There are two cases here:


1. Case I: k = c = yi , i.e., k is the correct label
(2)
∂CE(y, ŷ) exp(zk )
(2)
= −1 + C = −1 + ŷk = ŷk − yk
∂zk P (2)
exp(zi )
i=0

2. Case II: k ̸= yi
(2)
∂CE(y, ŷ) exp(zk )
(2)
=0+ C = ŷk − yk
∂zk P (2)
exp(zi )
i=0

Therefore, the gradient in vector notation simplifies to

∂CE(y, ŷ)
= ŷ − y (2)
∂z (2)
Recall that z (2) = W (2) a(1) + b(2) , such that z (2) ∈ RH2 ×1 , a(1) ∈ RH1 ×1 and W (2) ∈ RH2 ×H1 . Therefore,

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (2)


=
∂W (2) ∂z (2) ∂W (2)

∂CE(y, ŷ)
= (ŷ − y)[a(1) ]T (3)
∂W (2)

8
Similarly,
∂CE(y, ŷ)
= ŷ − y (4)
∂b(2)
Going across L2 :

∂z (2)
= [W (2) ]T
∂a(1)
∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (2)
= = [W (2) ]T (ŷ − y)
∂a(1) ∂z (2) ∂a(1)
Going across the non-linearity of L1 :

∂CE(y, ŷ) ∂CE(y, ŷ) ∂σ(z (1) )


=
∂z (1) ∂a(1) ∂z (1)
∂CE(y, ŷ)
= ◦ σ(z (1) ) ◦ (1 − σ(z (1) ))
∂a(1)
Note that we have assumed that σ(·) works on vectors (matrices) by applying an element-wise sigmoid,
and ◦ is the element-wise (Hadamard) product.
That brings us to our final gradients:

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (1)


=
∂W (1) ∂z (1) ∂W (1)

∂CE(y, ŷ)  ∂CE(y, ŷ)  T


= x (5)
∂W (1) ∂z (1)
Similarly,
∂CE(y, ŷ) ∂CE(y, ŷ)
= (6)
∂b(1) ∂z (1)
The above equations have been derived for a single training vector, but they extend seamlessly to a
matrix of N column vectors. In that case, you need to sum up over all the input images x.

Gradient Descent
Gradient Descent is an iterative algorithm for finding local minima of a function parameterized by some
parameters p. The gradient descent update rule is

p ← p − α ∇p J (7)

where α is the learning rate that controls how large the descent step is. ∇p J is the gradient of J with
respect to the network parameters p.
N
CE(i) since this requires computing CE(i) for all i =
P
In practice, we often do not compute J =
i=1
1, . . . , N . Instead, we divide the input into ‘mini-batches’ containing M images and process one mini-batch
k+M
CE(i) (where x(k) is
P
at a time until all images are processed. For each mini-batch we calculate Jmb =
i=k
the first image in the mini-batch), and update the network parameters p according to the update rule

p ← p − α ∇p Jmb . (8)

9
This algorithm is also referred to as Mini-batch Gradient Descent. See below for the pseudo-code, where an
‘epoch’ refers to a single iteration over all N images and corresponds to ⌈M/N ⌉ updates to the parameters
p. This approach usually leads to faster convergence than Batch Gradient Descent (or simply Gradient
Descent) since we update the network coefficients more than once per epoch.

Algorithm 1 Mini-batch Gradient Descent


epoch ← 0
while epoch < MAX_EPOCHS do
batches ← split(training_samples, M )
for batch in batches do
p ← p − step × gradient(batch)
end for
epoch ← epoch + 1
end while

10

You might also like