0% found this document useful (0 votes)
3 views

hw2

Homework 2 for CME 213 involves implementing a basic recurrence algorithm and element-wise kernels using CUDA for a deep neural network. Students will learn GPU programming, memory management, and optimization techniques while working on the Mandelbrot set and benchmarking with strided memory access. The assignment requires individual submissions, collaboration is allowed but sharing solutions is prohibited.

Uploaded by

saberwu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

hw2

Homework 2 for CME 213 involves implementing a basic recurrence algorithm and element-wise kernels using CUDA for a deep neural network. Students will learn GPU programming, memory management, and optimization techniques while working on the Mandelbrot set and benchmarking with strided memory access. The assignment requires individual submissions, collaboration is allowed but sharing solutions is prohibited.

Uploaded by

saberwu2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CME 213, Introduction to parallel computing

Eric Darve
Spring 2025

Homework 2

Total number of points: 100.


In this programming assignment, you will use NVIDIA’s Compute Unified Device Architecture (CUDA)
language to implement a basic recurrence algorithm and element-wise kernels for a deep neural network. In
the process, you will learn how to write general-purpose GPU programming applications and consider some
optimization techniques. You must turn in your own copy of the assignment as described below. Although
you can collaborate with your classmates on the assignment, sharing solutions is strictly prohibited. If you
have any queries regarding the task, kindly post them on the forum.
You can access an NVIDIA GPU, which is needed to complete this assignment, from the login node
cme213-login.stanford.edu. You can connect to this node using the command
ssh [email protected]. Each time you login you must run the command
ml course/cme213/nvhpc/24.1; alternatively, you can add this command to the file ~/.bashrc to automat-
ically load the required modules every time you login. Run make to compile your code, run make clean to
clear the directory of existing object files and executables, and run sbatch hw2.sh to submit your job to
the queue. The output will be in slurm.sh.out.
For all questions asking to comment on plots, make sure to describe the shape and different
regions (such as increasing performance or asymptotic behavior) of the graph and explain why
these patterns may emerge.

CUDA
“C for CUDA” is a programming language subset and extension of the C programming language and is
commonly referenced as simply CUDA. Many languages support wrappers for CUDA, but in this class, we
will develop in C for CUDA and compile with nvcc.
The programmer creates a general-purpose kernel to be run on a GPU, analogous to a function or
method on a CPU. The compiler allows you to run C++ code on the CPU and the CUDA code on the device
(GPU). Functions which run on the host are prefaced with __host__ in the function declaration. Kernels
run on the device are prefaced with __global__. Kernels that are run on the device and that are only called
from the device are prefaced with __device__.
The first step you should take in any CUDA program is to move the data from the host memory to device
memory. The function calls cudaMalloc and cudaMemcpy allocate and copy data, respectively. cudaMalloc
will allocate a specified number of bytes in the device main memory and return a pointer to the memory
block, similar to malloc in C. You should not try to dereference a pointer allocated with cudaMalloc from a
host function.
The second step is to use cudaMemcpy from the CUDA API to transfer a block of memory from the host
to the device. You can also use this function to copy memory from the device to the host. It takes four
parameters, a pointer to the device memory, a pointer to the host memory, a size, and the direction to move
data (cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost). We have already provided the code to copy
the string from the host memory to the device memory space, and to copy it back after calling your shift
kernel.
Kernels are launched in CUDA using the syntax kernelName<<<...>>>(...). The arguments inside of
the chevrons (<<<blocks, threads>>>) specify the number of thread blocks and thread per block to be
launched for the kernel. The arguments to the kernel are passed by value like in normal C/C++ functions.

1
There are some read-only variables that all threads running on the device possess. The three most
valuable to you for this assignment are blockIdx, blockDim, and threadIdx. Each of these variables contains
fields x, y, and z. blockIdx contains the x, y, and z coordinates of the thread block where this thread is
located. blockDim contains the dimensions of thread block where the thread resides. threadIdx contains
the indices of this thread within the thread block.
We encourage you to consult the development materials available from NVIDIA, particularly the CUDA
Programming Guide and the Best Practices Guide available at
http://docs.nvidia.com/cuda/index.html

Unit test fixtures


Sometimes when testing your code, you may notice that you are doing very similar operations to set up
certain tests. In such cases, GoogleTest provides test fixtures that enable you to reuse the same object
configuration for multiple tests. For the recurrence problem below, we will employ this approach to
streamline our testing procedures. Additional information on test fixtures can be found in the GoogleTest
documentation:
http://google.github.io/googletest/primer.html#same-data-multiple-tests
Fixture tests must use the macro TEST_F. The first argument of the macro must be the name of the test
fixture class, RecurrenceTestFixture. Using test fixtures, the following sequence of code is executed:

1. A RecurrenceTestFixture object is created.

2. The first test


TEST_F(RecurrenceTestFixture, GPUAllocationTest_1)
runs. This test is able to access objects and subroutines in the test fixture object
RecurrenceTestFixture.

3. Once the test completes, the test fixture object is destructed.

This sequence is repeated for all subsequent fixture tests TEST_F.

Problem 1 Recurrence
The purpose of this problem is to give you experience writing your first simple CUDA program. This
program will help us examine how various factors can affect the achieved performance.
Inspired by the Mandelbrot Set, we want to perform the following recurrence for several values of 𝑐:

𝑧𝑛+1 = 𝑧𝑛2 + 𝑐.

𝑧 is in general a complex number but for simplicity we will use floats in this homework. For each value of
𝑐, you can study the sequence 𝑧𝑛 . If 𝑧𝑛 does not diverge (starting from 𝑧 0 = 0) then the point 𝑐 belongs to
the Mandelbrot set. In Figure 1, the coordinates of each pixel correspond to the real and imaginary parts of
𝑐. The color of a pixel is determined by computing the smallest iteration 𝑛 for which |𝑧𝑛 | > 2. One can
prove that if |𝑧𝑛 | > 2 for some 𝑛 then |𝑧𝑛 | → ∞ as 𝑛 → ∞. The recurrence is done for a maximum of
num_iter iterations, and the values of the 𝑐’s are set in initialize_array() in main_q1.cu.

Code
You should be able to take the files we give you and type make main_q1 to build the executable. The
executable will run, but since the CUDA code hasn’t been written yet (that’s your job), it will report errors

2
Figure 1: Mandelbrot Set. Source: Wikipedia. The black points in the image belong to the Mandelbrot set.
The sequence 𝑧𝑛 does not diverge for the corresponding 𝑐. Although not obvious, the Mandelbrot set is a
connected set.

and quit. All locations where you need to write code are noted by a TODO in the comments. For this problem
we provide the following starter code (* means you should not modify the file):

• main_q1.cu—This is the main file. We have already written most of the code for this assignment so
you can concentrate on the CUDA code. We take care of computing the host solution and checking
your results against the host reference. There is also code to generate the tables you will need to do
the benchmarking questions. You will do questions 1 and 2 in this file.

• recurrence.cuh—This file already contains the necessary function headers—do not change these.
You should fill in the body of the kernel and launch the kernel from doGPURecurrence().

• *test_recurrence.h—This file contains the functions we will use to check your output. Do not
modify this file.

• *Makefile—make main_q1 will build the binary.


make clean will remove the executables. You should be able to build and run the program when you
first download it, however only the host code will run.

• hw2.sh—This script is used to submit jobs to the queue. You need to comment out the other lines in
the file if you only want to run ./main_q1.

Question 1.1
5 points. Allocate GPU memory for the input and output arrays for the recurrence within the test fixture
constructor. Free this GPU memory at the end in the destructor. This code (approx. 4 lines) should be in
the RecurrenceTestFixture class in main_q1.cu.
Question 1.2
5 points. Implement initialize_array(), the function that initializes an array of a given size. The values
are random floats between −1 and 1. These will be the constants 𝑐 in the recurrence. This code should be in
main_q1.cu.

3
Question 1.3
10 points. Implement the recurrence kernel and launch it. These should be implemented in recurrence()
and doGPURecurrence() respectively in recurrence.cuh. You can see a CPU implementation of the recur-
rence in host_recurrence() and a sample launch of the kernel in main(), both in main_q1.cu. Add the
output of the code (it should be 2 tables) to your PDF submission. The whole run may take 10 minutes.
Question 1.4
10 points. Run the same setup but with the number of blocks to be 72, the number of iterations to be 40,000,
and the array size (number of constants we test) to be 1,000,000 (this code has already been written for
you). Vary the number of threads per block as 32, 64, 96, . . . , 1024. Take the table that is generated and plot
the performance in TFlops/sec vs. the number of threads. Comment on and explain the shape of the graph.
Question 1.5
5 points. Run the same setup with the number of threads per block to be 256, the number of blocks to be 576,
and the array size (number of constants we test) to be 1,000,000 (this code has already been written for you).
Vary the number of iterations as in the code. Take the table that is generated and plot the performance in
TFlops/sec vs. the number of iterations. Comment on and explain the shape of the graph.

Problem 2 Benchmarking with Strided Memory Access


For this problem, we will benchmark our device by performing strided memory accesses. The file
benchmark.cuh performs a benchmark using two very long input arrays 𝑥 and 𝑦, as well as an output
array 𝑧, by computing 𝑧 [𝑖] = 𝑥 [𝑖] + 𝑦 [𝑖] at stride lengths between 1 and 32. That is, 𝑧 [𝑖] = 𝑥 [𝑖] + 𝑦 [𝑖] for
𝑖 ∈ {0, 1, 2, 3, . . . }, 𝑖 ∈ {0, 2, 4, 6, . . . }, . . . , 𝑖 ∈ {0, 32, 64, 96, . . . }.
For this problem, we provide the following starter code (* means you should not modify the file):
1. *main_q2.cu—sets up the CUDA runtime and launches your benchmarking kernel with stride lengths
in 1. . . 32
2. benchmark.cuh—this is the file you will need to modify and submit. Do not change the function
headers but fill in the bodies and follow the hints/requirements in the comments.
3. *Makefile
$ make main_q2
will build the benchmarking binary. You can run it using sbatch hw2.sh (see item 4 in this list) or
srun -p gpu-turing -G 1 ./main_q2.
$ make clean
will remove the executables. You should be able to build and run the program when you first download
it. However, your results will be incorrect as your kernel won’t be performing any memory accesses.
4. hw2.sh—This script is used to submit jobs to the queue. You need to comment out the other lines in
the file if you only want to run ./main_q2.
Question 2.1
5 points. Implement the strided memory access in benchmark.cuh. Run your code on the cluster. In your
writeup, display the results on a semilogy plot of throughput in GB/s as a function of stride length. Do not
comment on the plot under this part.
Question 2.2
5 points. Comment on and explain the shape of the graph. Why do we observe the trend that we do as the
stride length increases?

4
Neural Networks on CUDA
The goal is to implement a two-layer neural network which can identify digits from handwritten
images (a specific case of image classification problem). Neural networks are widely used in machine
learning problems, specifically in the domains of image processing, computer vision, and natural language
processing. There is a flurry of research projects on deep learning, which uses more advanced variants
of the simpler neural network (NN) we cover here. Therefore, being able to train large neural networks
efficiently is important and is the goal of this project.

Data and Notation


We will be using the MNIST [1] dataset, which consists of greyscale 28 × 28 pixel images of handwritten
digits from 0 to 9. Some examples of this dataset are shown in Figure 2.

Figure 2: Examples of MNIST digits

The dataset is divided into two parts: 60,000 images in the training set and 10,000 images in the test set.
We will use the training set to optimize the parameters of our neural network (described below), and we
will use the unseen test set to measure the performance of the trained network. We denote the 𝑖 𝑡ℎ image
sample in the training set as (𝑥 (𝑖 ) , 𝑦 (𝑖 ) ) where 𝑥 (𝑖 ) denotes the image and 𝑦 (𝑖 ) denotes the class label (the
digit shown in the image).

Neural Networks
Neurons
To describe neural networks, we begin by describing the simplest neural network, which comprises a single
“neuron. ” Figure 3 illustrates a single neuron.

𝑥1
ℎ𝑊 ,𝑏 (𝑥)
𝑥2

𝑥3

Figure 3: A single neuron.

This neuron is a computational unit that takes as input (𝑥 1, 𝑥 2, 𝑥 3 ) and outputs


3
 ∑︁ 
ℎ𝑊 ,𝑏 (𝑥) = 𝑓 (𝑊 𝑥 + 𝑏) = 𝑓 𝑊𝑖 𝑥𝑖 + 𝑏
𝑖=1

5
where 𝑓 : R → R is a non-linear ‘activation’ function, 𝑊 is the weights of the neuron, and 𝑏 is the bias of
the neuron. The pair 𝑝 = (𝑊 , 𝑏) is referred to as the parameters of the neuron.
For this project, we set 𝑓 (·) to be the sigmoid function
1
𝑓 (𝑧) = 𝜎 (𝑧) = .
1 + exp(−𝑧)
The derivative of the sigmoid function with respect to its input is
𝜕𝜎 (𝑥) 1 𝜕 exp(−𝑥) exp(−𝑥)
=− = = 𝜎 (𝑥) (1 − 𝜎 (𝑥));
𝜕𝑥 (1 + exp(−𝑥)) 2 𝜕𝑥 (1 + exp(−𝑥)) 2
we will use this fact repeatedly in the following sections. Other common activation functions include
𝑓 (𝑧) = tanh(𝑧) and the rectified linear unit (ReLU) 𝑓 (𝑧) = max(0, 𝑧). These are illustrated in Figure 4.

4 𝑦

𝑥
−4 −2 2 4

−1

tanh(𝑥) Sigmoid ReLU

Figure 4: Examples of three activation functions: tanh(𝑥), 1/(1 + exp(−𝑥)) (sigmoid), and the rectified
linear unit (ReLU).

A single neuron can be trained to perform the task of binary classification, Consider the example of
cancer detection, where the task is to classify a tumor as benign or malignant. We can provide as input
𝑥 = ( size of the tumor, location of tumor, length of time the tumor has existed), and if the label is
(
1 the tumor is malignant
𝑦=
0 the tumor is benign

we can say that the neuron predicts that the tumor is malignant if and only if 𝑓 (𝑊 𝑥 + 𝑏) > 0.5.
Since the value of 𝑓 (𝑊 𝑥 + 𝑏) depends on the sign of 𝑊 𝑥 + 𝑏, the neuron effectively partitions the input
space R3 using a 2-dimensional hyperplane. On one side of the hyperplane 𝑓 (𝑊 𝑥 + 𝑏) > 0.5, and on the
other 𝑓 (𝑊 𝑥 + 𝑏) < 0.5. Through an optimization process referred to as ‘training’, we want to find values of
the parameters 𝑊 and 𝑏 such that the hyperplane represented by the neuron is as close as possible to the
‘true’ hyperplane.
More generally, we want to find values of the parameters 𝑊 and 𝑏 such that the network’s predictions
are ‘good’ on an unseen test set, since this would imply that our choice of model (here, a neuron with

6
certain values of 𝑊 and 𝑏) is close to the ‘true’ model. It is insufficient to observe good predictions on the
training set, since sufficiently complex networks can be trained to make perfect predictions on the training
set but they perform much worse on unseen data, implying that the trained model is not close to the ‘true’
model.
In the project, we want to perform the task of multi-class classification, rather than binary classification.
Instead of a simple true/false output, we need to decide which digit from 0 to 9 is shown in the input image.

Fully-connected feedforward neural network

Layer 0 Layer 1 Layer 2

Input x1

Input x2

Input x3

Input x4

Figure 5: Fully Connected Feedforward Neural Network with 2 layers

Figure 5 shows a fully-connected feedforward neural network with 1 input layer, 1 hidden layer, and 1
output layer. We call such a network a two-layer neural network (ignoring the input layer as it is trivially
present). Let us denote the input as 𝑥 ∈ R1×𝑑 , the number of neurons in layer 𝑖 as 𝐻𝑖 , and the parameters of
layer 𝑖 as (𝑊 (𝑖 ) , 𝑏 (𝑖 ) ).
In our problem, we are trying to determine the digit associated with each image. We will call this digit
the “label” associated with the image. The total number of labels is denoted 𝐶. In our case 𝐶 = 10, since we
are trying to determine digits 0 to 9.
Recall that the parameters of a single neuron are 𝑊 ∈ R1×𝑑 and 𝑏 ∈ R, i.e., 𝑊 is a vector of the same
dimensionality as the input and the bias is simply a scalar. Therefore, we can represent the parameters of
layer 𝑖 of the neural network in matrix form as 𝑊 (𝑖 ) ∈ R𝐻𝑖 ×𝐻𝑖 −1 and 𝑏 (𝑖 ) ∈ R𝐻𝑖 ×1 . Similarly, if we had 𝑁
input vectors, we denote the input collectively as 𝑋 ∈ R𝑁 ×𝑑 .
In Figure 5, 𝑑 = 4, 𝐻 1 = 5, 𝐻 2 = 3, 𝑊 (1) ∈ R5×4 , 𝑏 (1) ∈ R5×1 , 𝑊 (2) ∈ R3×5 , 𝑏 (2) ∈ R3×1 .
The last layer is special. This is the output of our network. In the project, we have 𝐶 = 10 output nodes.
Each node represents a possible digit. We will see later on how the output vector 𝑦ˆ can be interpreted to
determine the digit that is predicted by the network for an input image.

Feed-forward
The nice thing about neural networks is that they are highly modular. Layer 𝐿𝑖 does not need to know
whether its input is the input layer itself or the output of 𝐿𝑖 −1 . 𝐿𝑖 computes its activations as 𝑎 (𝑖 ) =
𝑓 (𝑖 ) (𝑊 𝑎 (𝑖 −1) + 𝑏 (𝑖 ) ), with 𝑎 (0) = 𝑥, where 𝑓 (𝑖 ) is the non-linearity used by 𝐿𝑖 (sigmoid, by default). Feed-
forward is the process of computing the activations of all neurons in the network layer-by-layer, from 𝑖 = 1
to 𝑖 = 2 in our case.

7
Let us perform the feed-forward for the network in Figure 5:

𝑧 (1) = 𝑊 (1) 𝑥 + 𝑏 (1)


𝑎 (1) = 𝜎 (𝑧 (1) )
𝑧 (2) = 𝑊 (2) 𝑎 (1) + 𝑏 (2)
𝑦ˆ = 𝑎 (2) = softmax(𝑧 (2) )

𝑦ˆ is the output of the network. Note that we have represented the linear transformation of the 𝐿𝑖 by 𝑧 (𝑖 ) .
This will help us in the following sections. The softmax function is defined by:

exp(𝑧 𝑗(2) )
softmax(𝑧 (2) ) 𝑗 = 𝑃 (label = 𝑗 |𝑥) =
def def
𝐶
exp(𝑧𝑖(2) )
Í
𝑖=1

This equation is saying that the probability that the input has label 𝑗 (i.e., in our case, the digit 𝑗 is
handwritten in the input image) is given by softmax(𝑧 (2) ) 𝑗 . Therefore, our predicted label for the input 𝑥 is
given by:
label = argmax(𝑦) ˆ
This is basically the digit the network believes is written in the input image.

Training
Recall that our objective is to learn the parameters of the neural network such that it gets the best accuracy
on the test data. Let 𝑦 be the one-hot vector denoting the class of the input, i.e., 𝑦𝑐 = 1 if 𝑐 is the correct
label, 0, otherwise. We want 𝑃 (label = 𝑐 |𝑥) to be the highest (e.g., close to 1).
Without going into the mathematical details, we will use the following general expression to determine
the error of our neural network. This expression turns out to be the most convenient for our purpose:
𝐶
∑︁−1
ˆ =−
CE(𝑦, 𝑦) 𝑦𝑖 log (𝑦ˆ𝑖 )
𝑖=0

𝐶𝐸 stands for cross-entropy. Since 𝑦 is a one-hot vector, this simplifies to

ˆ = − log (𝑦ˆ𝑐 )
CE(𝑦, 𝑦)

We can observe that 𝐶𝐸 is 0 when we have the optimal answer 𝑦ˆ𝑐 = 1. Similarly, 𝐶𝐸 is maximal (+∞) when
𝑦ˆ𝑐 is 0. This corresponds to a neural network that is “sure” that the digit is not 𝑐 (maximally wrong).
The total cost for 𝑁 input data points (such that the cross-entropy of the 𝑖 𝑡ℎ training vector is denoted
as 𝐶𝐸 (𝑖 ) ) is:
𝑁
1 ∑︁ (𝑖 )
cost = 𝐽 (𝑊 , 𝑏; 𝑥, 𝑦) = 𝐶𝐸 (𝑦, 𝑦)
ˆ
𝑁 𝑖=1
The above cost measures the error, i.e. our “dissatisfaction”, with the output of the network. The more
certain the network is about the correct label (high 𝑃 (𝑦 = 𝑐 |𝑥)), the lower our cost will be.
Clearly, we should choose the parameters that minimize this cost. This is an optimization problem, and
may be solved using the method of Stochastic Gradient Descent (described below).
Our neural network applies a non-linear function to the input because of the sigmoid and softmax
functions. When optimizing the neural network, we often add a penalization term for the magnitude of 𝑊

8
in order to control the non-linearity of the network. If we make 𝑊 smaller, the network becomes ‘more
linear’ since 𝑊 𝑥 ≈ 𝜎 (𝑊 𝑥) when 𝑊 𝑥 ≈ 0. Despite the possibility of making 𝑊 too small and the fact
that there is no rigorous justification for this penalization, it is found to work well in practice. With the
penalization term, the cost function becomes

𝑁
1 ∑︁ (𝑖 ) 𝜆
𝐽 (𝑊 , 𝑏; 𝑥, 𝑦) = ˆ + ∥𝑊 ∥ 22
𝐶𝐸 (𝑦, 𝑦) (1)
𝑁 𝑖=1 2

where ∥𝑊 ∥ 22 is the sum of the l2 -norm of all the weights 𝑊 of the network, and 𝜆 is a hyperparameter that
needs to be tuned for best performance. In our implementation, only the weights 𝑊 are penalized, not the
biases 𝑏.

Gradient Descent
Gradient Descent is an iterative algorithm for finding local minima of a function. For our case,

𝑝 ← 𝑝 − 𝛼 ∇𝑝 𝐽 (2)

where 𝛼 is the learning rate that controls how large the descent step is. ∇𝑝 𝐽 is the gradient of 𝐽 with respect
to the network parameters 𝑝.
𝑁
In practice, we often do not compute 𝐽 = CE (𝑖 ) since this requires computing CE (𝑖 ) for all 𝑖 = 1, . . . , 𝑁 .
Í
𝑖=1
Instead, we divide the input into ‘mini-batches’ containing 𝑀 images and process one mini-batch at a time
𝑘+𝑀
CE (𝑖 ) (where 𝑥 (𝑘 ) is the first
Í
until all images are processed. For each mini-batch we calculate 𝐽mb =
𝑖=𝑘
image in the mini-batch), and update the network parameters 𝑝 according to the update rule

𝑝 ← 𝑝 − 𝛼 ∇𝑝 𝐽mb . (3)

This algorithm is also called Mini-batch Gradient Descent. See Listing 6 for the pseudo-code, where an
‘epoch’ refers to a single iteration over all 𝑁 images and corresponds to ⌈𝑀/𝑁 ⌉ updates to the parameters
𝑝. This approach usually leads to faster convergence than Batch Gradient Descent (or simply Gradient
Descent) since we update the network coefficients more than once per epoch.

epoch = 0
while epoch < MAX_EPOCHS :
b a t c h e s = s p l i t ( t r a i n i n g _ s a m p l e s , M)
f o r b a t c h in b a t c h e s :
p = p − step ∗ gradient ( batch )
epoch += 1

Figure 6: Mini-batch Gradient Descent

Backpropagation
Backpropagation is the process of updating the neural network coefficients. This involves computing the
gradient of the multi-variable loss function using the chain rule, to obtain ∇𝑝 𝐽 .

9
Let’s compute the gradient for the parameters in the last layer (2) of our network:
h Í i
(2) 𝐶 (2)

" #
(2) 𝜕 𝑧 log exp(𝑧
𝜕CE(𝑦, 𝑦)ˆ 𝜕 exp(𝑧𝑐 ) 𝑐 𝑖=0 𝑖
(2)
= − (2) log Í (2)
=− (2)
𝐶
𝜕𝑧𝑘 𝜕𝑧𝑘 𝑖=0 exp(𝑧𝑖 ) 𝜕𝑧𝑘

There are two cases here:

1. Case I: 𝑘 = 𝑐 = 𝑦𝑖 , i.e., 𝑘 is the correct label

ˆ
𝜕CE(𝑦, 𝑦) exp(𝑧𝑘(2) )
= −1 + = −1 + 𝑦ˆ𝑘 = 𝑦ˆ𝑘 − 𝑦𝑘
𝜕𝑧𝑘(2) 𝐶
Í
exp(𝑧𝑖(2) )
𝑖=0

2. Case II: 𝑘 ≠ 𝑦𝑖
ˆ
𝜕CE(𝑦, 𝑦) exp(𝑧𝑘(2) )
=0+ = 𝑦ˆ𝑘 − 𝑦𝑘
𝜕𝑧𝑘(2) 𝐶
Í
exp(𝑧𝑖(2) )
𝑖=0

Therefore, the gradient in vector notation simplifies to

𝜕CE(𝑦, 𝑦)ˆ
(2)
= 𝑦ˆ − 𝑦 (4)
𝜕𝑧
Recall that 𝑧 (2) = 𝑊 (2) 𝑎 (1) + 𝑏 (2) , such that 𝑧 (2) ∈ R𝐻2 ×1 , 𝑎 (1) ∈ R𝐻1 ×1 and 𝑊 (2) ∈ R𝐻2 ×𝐻1 . Therefore,

ˆ
𝜕CE(𝑦, 𝑦) ˆ 𝜕𝑧 (2)
𝜕CE(𝑦, 𝑦)
=
𝜕𝑊 (2) 𝜕𝑧 (2) 𝜕𝑊 (2)

ˆ
𝜕CE(𝑦, 𝑦)
(2)
= (𝑦ˆ − 𝑦) [𝑎 (1) ]𝑇 (5)
𝜕𝑊
Similarly,
𝜕CE(𝑦, 𝑦)ˆ
(2)
= 𝑦ˆ − 𝑦 (6)
𝜕𝑏
Going across 𝐿2 :

𝜕𝑧 (2)
= [𝑊 (2) ]𝑇
𝜕𝑎 (1)
𝜕CE(𝑦, 𝑦)ˆ 𝜕CE(𝑦, 𝑦) ˆ 𝜕𝑧 (2)
= = [𝑊 (2) ]𝑇 (𝑦ˆ − 𝑦)
𝜕𝑎 (1) 𝜕𝑧 (2) 𝜕𝑎 (1)
Going across the non-linearity of 𝐿1 :

𝜕CE(𝑦, 𝑦)ˆ 𝜕CE(𝑦, 𝑦)ˆ 𝜕𝜎 (𝑧 (1) )


=
𝜕𝑧 (1) 𝜕𝑎 (1) 𝜕𝑧 (1)
𝜕CE(𝑦, 𝑦)ˆ
= ◦ 𝜎 (𝑧 (1) ) ◦ (1 − 𝜎 (𝑧 (1) ))
𝜕𝑎 (1)
Note that we have assumed that 𝜎 (·) works on vectors (matrices) by applying an element-wise sigmoid,
and ◦ is the element-wise (Hadamard) product.

10
That brings us to our final gradients:

ˆ
𝜕CE(𝑦, 𝑦) ˆ 𝜕𝑧 (1)
𝜕CE(𝑦, 𝑦)
=
𝜕𝑊 (1) 𝜕𝑧 (1) 𝜕𝑊 (1)

ˆ
𝜕CE(𝑦, 𝑦)  𝜕CE(𝑦, 𝑦)ˆ  𝑇
= 𝑥 (7)
𝜕𝑊 (1) 𝜕𝑧 (1)
Similarly,
𝜕CE(𝑦, 𝑦)ˆ 𝜕CE(𝑦, 𝑦)ˆ
= (8)
𝜕𝑏 (1) 𝜕𝑧 (1)
The above equations have been derived for a single training vector, but they extend seamlessly to a
matrix of 𝑁 column vectors. In that case, you need to sum up over all the input images 𝑥.

Problem 3 Element-wise CUDA kernels


In this homework, you will implement classes to manage memory on the GPU for matrices and GPU kernels
to accelerate element-wise matrix operations that need to be performed in the forward and backward passes
during training. Follow the comments in gpu_func.h and main_q3.cpp to help guide your implementations.
We will use the dense matrix class and related functions from the Armadillo C++ library; please refer to the
documentation at https://arma.sourceforge.net/docs.html.
To allow for easily switching between training in single precision or double precision, we have defined
the type alias nn_real in common.h. You should use this type alias instead of using float or double directly.
We have also defined the macros Log and Exp in common.h, which you should use to compute the natural
logarithm and exponential within your kernels.
Question 3.1
3 points. Implement the operators and functions (4 //TODOs) for the MatrixAccessor class in gpu_func.cu.
Question 3.2
12 points. Implement the constructor, the destructor, the private variables access functions and the
getAccessor, copyFromHost, copyToHost and deepCopy functions (10 \\TODOs) for the DeviceMatrix class
in gpu_func.cu.
Question 3.3
5 points. Implement the kernel MatSigmoid and the kernel wrapper method DSigmoid in gpu_func.cu.
Question 3.4
5 points. Implement the kernel MatRepeatColVec and the kernel wrapper method DRepeatColVec in
gpu_func.cu.

Question 3.5
5 points. Implement the kernel MatSum and the kernel wrapper method DSum in gpu_func.cu.
Question 3.6
5 points. Implement the kernel MatSoftmax and the kernel wrapper method DSoftmax in gpu_func.cu.
Question 3.7
5 points. Implement the kernel MatCrossEntropyLoss and the kernel wrapper method DCELoss in
gpu_func.cu.

11
Question 3.8
5 points. Implement the kernel MatElemArith and the kernel wrapper method DElemArith in
gpu_func.cu.

Question 3.9
5 points. Implement the kernel MatSquare and the kernel wrapper method DSquare in gpu_func.cu.
Question 3.10
5 points. Implement the kernel MatSigmoidBackProp and the kernel wrapper method DSigmoidBackprop
in gpu_func.cu.

References
[1] Yann LeCun et. al. MNIST. http://yann.lecun.com/exdb/mnist/. [Online].

A Submission instructions
To submit:

1. For all questions that require explanations and answers besides source code, put those explanations
and answers in a separate single PDF file. Upload this file on Gradescope.

2. Submit your code by uploading a zip file on Gradescope. Here is the list of files we are expecting:

main_q1.cu
recurrence.cuh
benchmark.cuh
gpu_func.cu
main_q3.cpp

We will not evaluate any code in files not listed above. Make sure to keep all file names as they are.

B Advice and Hints


• For debugging it will be helpful to limit the number of cases being run to 1. In the recurrence problem,
do this by using 1 value instead of the arrays for the 3 for loops.

• If you need some documentation on CUDA, you can look at the documents linked on canvas or visit
the CUDA website at https://docs.nvidia.com/cuda/index.html.

• An easy way to transfer the table output into a plot is to copy the space-separated program output,
paste it into a Google Sheet, highlight the column that contains the data, and click “Data→Split text
to columns” in the top banner, then hightlight your new columns and click “Insert→Chart.”

12

You might also like