0% found this document useful (0 votes)

20 views10 pages

HW 5

Homework 5 for CME 213 involves implementing a GPU-based neural network class and its functionalities, including forward and backward passes, loss calculation, and optimizer steps. Students are required to deliver code implementations for various functions in 'neural_network.cpp' and analyze performance using Nsight profiling. The project utilizes the MNIST dataset for training and testing the neural network's classification capabilities.

Uploaded by

saberwu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views10 pages

HW 5

Uploaded by

saberwu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CME 213, Introduction to parallel computing

Eric Darve
Spring 2025

Homework 5

Total number of points: 100.

Problem 1 Neural Network on the GPU

Similar to how you implemented a class DeviceMatrix to represent matrices on the GPU in Homework 2,
you will implement a class to represent neural networks on the GPU. The purpose of these abstractions is
to make it easier to write and debug code. All information on the neural network architecture can be found
in section C.
In this homework, you will implement a class DeviceNeuralNetwork.

To simplify backward calculations, assume that the neural network has two layers.
Question 1.1
(5 points) Implement the constructor for the class DeviceNeuralNetwork in neural_network.cpp.

Deliverables:

1. Code: The completed class DeviceNeuralNetwork in neural_network.cpp.

Question 1.2
(5 points) Implement the member function DeviceNeuralNetwork::to_cpu.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::to_cpu in neural_network.cpp.

Question 1.3
(5 points) Implement the constructor for the class GPUGrads.

Deliverables:

1. Code: The completed constructor GPUGrads::GPUGrads in neural_network.cpp.

Problem 2 Forward Pass

Question 2.1
(30 points) Implement the member function DeviceNeuralNetwork::forward in neural_network.cpp. We
will test your implementation by comparing the cached outputs of each layer with the cached outputs of
the CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::forward in neural_network.cpp.

1
Problem 3 Loss
Question 3.1
(10 points) Implement the member function DeviceNeuralNetwork::loss in neural_network.cpp. We
will test your implementation by comparing its output with the output of the CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::loss in neural_network.cpp.

Problem 4 Backward Pass

Question 4.1
(30 points) Implement the member function DeviceNeuralNetwork::backward in neural_network.cpp.
You can use the wrapper function TiledGEMM in gpu_func.cu to perform the GEMM operation

C ← αAB + βC

when either A or B is transposed. You are welcome to use or modify your own GEMM implementation
from Homework 4. Note that only neural_network.cpp will be submitted for grading. We will test your
implementation by comparing the cached gradients with respect to each parameter with the cached gradients
of CPU implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::backward in neural_network.cpp.

Problem 5 Optimizer Step

Question 5.1
(5 points) Implement the member function DeviceNeuralNetwork::step in neural_network.cpp. We will
test your implementation by comparing the updated parameters with the updated parameters of the CPU
implementation.

Deliverables:

1. Code: The completed function DeviceNeuralNetwork::step in neural_network.cpp.

Problem 6 Profiling with Nsight

Question 6.1
(10 points) Use Nsight to profile a training loop for the neural network that we have provided. To do
so, after studying the training loop main_q6.cpp (it is already implemented), profile main_q6 using the
commands provided in sbatch [Link]. For reference, the command is:

nsys profile --trace=cuda,nvtx,osrt --output=mainq6_profile --force-overwrite=true main_q6

This will produce a file, mainq6_profile.nsys-rep, that you will need to download and view in Nsight
systems in your local machine.

2
Study the profile output and identify parts of the training loop that can be changed to improve perfor-
mance. Look for improvements both in the training loop code that we have provided and the code you
wrote in Problems 1-5 that are used in the loop. Include screenshots of relevant parts of the profiling output
in your answer.

Deliverables:

1. Writeup: Comments and analysis of your Nsight output with screenshots to justify.

References
[1] Yann LeCun et. al. MNIST. [Link] [Online].

A Running the Code

1. We have provided a script [Link] that compiles the code and runs all tests. You can run it on the
cluster using sbatch [Link].

2. You can compile the code using make or

DOUBLE_FLAG=-DUSE_DOUBLE make
to create executables that use double-precision floating point numbers.

3. You can add additional tests with different size configurations to make sure your code runs correctly,
but we only require neural_network.cpp for submission.

B Submission instructions
1. For all questions that require explanations and answers besides source code, put those explanations
and answers in a separate single PDF file. Upload this file on Gradescope.

2. Submit your code by uploading a zip file on Gradescope. Here is the list of files we are expecting:

neural_network.cpp

We will not evaluate any code in files not listed above. Make sure to keep all file names as they are.

C Neural Networks on CUDA

Neural networks are widely used in machine learning problems, specifically in the domains of image
processing, computer vision, and natural language processing. There is a flurry of research projects on
deep learning, which uses more advanced variants of the simpler neural network we cover here. Therefore,
being able to train neural networks efficiently is important and is the goal of this project.

3
Figure 1: Examples of MNIST digits.

Data: MNIST
We will be using the MNIST [1] dataset, which consists of 28 × 28 greyscale images of handwritten digits
from 0 to 9. Some examples from this dataset are shown in Figure 1.
The dataset is divided into a training set of 60,000 images and a test set of 10,000 images. We will use the
training set to optimize the parameters of our neural network and we will use the unseen test set to measure
the performance of the trained network. We denote the ith example in the training set by (x(i) , y (i) ), where
x(i) denotes the image and y (i) denotes the corresponding class label (i.e. the digit shown in the image x(i) ).

Model: Neural Networks

Neurons
To describe neural networks we begin by describing the simplest neural network, which comprises a single
neuron.
x1
hW,b (x)
x2

Figure 2: A single neuron.

The neuron illustrated in Figure 2 is a computational unit that takes as input x = (x1 , x2 , x3 ) and
outputs
X3
hW,b (x) = f (W x + b) = f Wi xi + b ,
i=1
where f : R → R is some non-linear activation function, W is the weight of the neuron, and b is the bias of
the neuron. The row vector W and the scalar b are referred to as the parameters of the neuron, and the
output of the neuron is referred to as its activation.
In this project, we let f be the sigmoid function given by
1
f (z) = σ(z) = .
1 + exp(−z)
The derivative of the sigmoid function with respect to its input is
∂σ(x) 1 ∂ exp(−x) exp(−x)
=− 2
= = σ(x)(1 − σ(x));
∂x (1 + exp(−x)) ∂x (1 + exp(−x))2

4
we will use this fact repeatedly in the following sections. Other common activation functions include
f (z) = tanh(z) and the rectified linear unit (ReLU) f (z) = max(0, z). These are illustrated in Figure 3.

4 y

x
−4 −2 2 4

−1

tanh(x) Sigmoid ReLU

Figure 3: Examples of three activation functions: tanh(x), 1/(1 + exp(−x)) (sigmoid), and the rectified
linear unit (ReLU).

A single neuron can be trained to perform the task of binary classification. Consider the example of
cancer detection, where the task is to classify a tumor as benign or malignant. We can provide as input
x = ( size of tumor, location of tumor, length of time for which the tumor has existed), and if the label is
(
1 the tumor is malignant
y= ,
0 the tumor is benign

we can say that the neuron predicts that the tumor is malignant if and only if f (W x + b) > 0.5.
Since the value of f (W x + b) depends on the sign of W x + b, the neuron effectively partitions the input
space R3 using a 2-dimensional hyperplane. On one side of the hyperplane we have f (W x + b) > 0.5, and
on the other side of the hyperplane we have f (W x + b) < 0.5. Through an optimization process referred
to as training, we want to find values of the parameters W and b such that the hyperplane represented by
the neuron is as close as possible to the ‘true’ hyperplane.
More generally, we want to find values of the parameters W and b such that the network’s predictions
are ‘good’ on an unseen test set, since this would imply that our choice of model (here, a neuron with
certain values of W and b) is close to the ‘true’ model corresponding to reality.
It is insufficient to observe good predictions on the training set. Sufficiently complex networks can be
trained to make perfect predictions on the training set but they perform much worse on unseen data that
they were not trained on, implying that the trained model is not close to the ‘true’ model.
In this project, we would like to train a neural network to perform multi-class classification rather than
binary classification. Instead of simply predicting true or false, we would like the network we train to be
able to accurately predict which of 10 different digits is shown in the input image.

5
Layer 0 Layer 1 Layer 2

x1
ŷ1
x2
ŷ2
x3
ŷ3
x4

Figure 4: Fully connected feedforward neural network with two layers.

Fully connected feedforward neural network

Figure 4 shows a fully connected feedforward neural network with an input layer, one hidden layer, and an
output layer. Such a network is referred to as a ‘two-layer fully connected feedforward neural network’ or
a ‘two-layer multilayer perceptron (MLP)’.
The input layer is not counted since a neuron in the input layer performs no computation. For example,
the first neuron in the input layer takes as input x1 and outputs x1 . As this output travels along the edge
connecting the first neuron in the input layer to the first neuron in the hidden layer, it is multiplied by the
weight W1 of the first neuron in the hidden layer. Once it reaches the first neuron in the hidden layer, it
is added to the bias b of the first neuron in the hidden layer, and the result is passed through the sigmoid
function to obtain the activation of the first neuron in the hidden layer.
This process must be repeated for each element of the input vector x ∈ Rd×1 and for each of H1 neurons
in layer 1. An efficient way to do this is to use matrix-multiplication and compute a(1) = f (W (1) x + b(1) ),
(1)
where W (1) ∈ RH1 ×d , b(1) ∈ RH1 ×1 , and f is the sigmoid function. The element Wij is the jth weight of
(1)
the ith neuron in layer 1, the element bi is the bias of the ith neuron in the layer 1, and the (W (1) , b(1) )
are referred to as the parameters of layer 1. The vector a(1) ∈ RH1 ×1 is referred to as the activation of layer
1 and it consists of the activations of the neurons in layer 1.
The activation function of the output layer is special. Instead of each neuron independently applying the
sigmoid function to its input, all neurons in the output layer collectively compute softmax(W (2) a(1) + b(2) ).
If C denotes the number of possible class labels, we have C = 10 since there are 10 possible digits 0, . . . , 9.
For 1 ≤ i ≤ 10, using the softmax activation function allows us to interpret the ith element of the output
vector ŷ ∈ RC×1 as the neural network’s prediction of the probability that the digit in the input image is
digit i − 1.
In general, if Hi is the number of neurons in layer i, then the parameters of layer i are W (i) ∈ RHi ×Hi−1
and b(i) ∈ RHi ×1 . In Figure 4 we have d = H0 = 4, H1 = 5, H2 = 3 =⇒ W (1) ∈ R5×4 , b(1) ∈ R5×1 ,
W (2) ∈ R3×5 , and b(2) ∈ R3×1 . To efficiently process a batch ofd×Ninputs x1 , . . . , xN ∈ Rd×1 , we can stack
them horizontally to obtain a matrix X = x1 · · · xN ∈ R , and compute a batch of activations
(1) (1) (1) H ×N (1) (1) (1) ∈ RH1 ×N .

A = f (W X + B ) ∈ R 1 where B = b ··· b

6
Forward pass
The forward pass is the process of computing the activations of all neurons in the network for an (batch of)
input x. For a two-layer MLP, we compute

z (1) = W (1) x + b(1)

a(1) = σ(z (1) )
z (2) = W (2) a(1) + b(2)
ŷ = a(2) = softmax(z (2) )

The softmax function is defined by:

(2)
(2) def def exp(zj )
softmax(z )j = P (label = j|x) = C
P (2)
exp(zi )
i=1

This equation is saying that the probability that the input has label j (i.e., in our case, the digit j is
handwritten in the input image) is given by softmax(z (2) )j . Therefore, our predicted label for the input x is
given by:
label = argmax(ŷ)
This is basically the digit the network believes is written in the input image.

Loss
Recall that our objective is to learn the parameters of the neural network such that it gets the best accuracy
on the test set. Let y be the one-hot vector denoting the class of the input; yc = 1 if c is the correct label
and yi = 0 for all i ̸= c. We want P (label = c|x) to be the highest (e.g., close to 1).
Without going into the mathematical details, we will use the following general expression to determine
the error of our neural network. This expression turns out to be the most convenient for our purpose:
C
X
CE(y, ŷ) = − yi log (ŷi )
i=1

CE stands for cross-entropy. Since y is a one-hot vector, this simplifies to

CE(y, ŷ) = − log (ŷc )

We can observe that CE is 0 when we have the optimal answer ŷc = 1. Similarly, CE is maximal (+∞)
when ŷc is 0. This corresponds to a neural network that is “sure” that the digit is not c (maximally wrong).
The total cost for N input data points (such that the cross-entropy of the ith training vector is denoted
as CE(i) ) is:
N
1 X (i)
cost = J(W, b; x, y) = CE (y, ŷ)
N
i=1

The above cost measures the error, i.e. our “dissatisfaction”, with the output of the network. The more
certain the network is about the correct label (high P (y = c|x)), the lower our cost will be.
Clearly, we should choose the parameters that minimize this cost. This is an optimization problem, and
may be solved using the method of Stochastic Gradient Descent (described below).

7
Our neural network applies a non-linear function to the input because of the sigmoid and softmax
functions. When optimizing the neural network, we often add a penalization term for the magnitude of W
in order to control the non-linearity of the network. If we make W smaller, the network becomes ‘more
linear’ since W x ≈ σ(W x) when W x ≈ 0. Despite the possibility of making W too small and the fact
that there is no rigorous justification for this penalization, it is found to work well in practice. With the
penalization term, the cost function becomes

N
1 X λ
J(W, b; x, y) = CE (i) (y, ŷ) + ∥W ∥22 (1)
N 2
i=1

where ∥W ∥22 is the sum of the l2 -norm of all the weights W of the network, and λ is a hyperparameter that
needs to be tuned for best performance. In our implementation, only the weights W are penalized, not the
biases b.

Backward Pass
The backward pass is the process of using the chain rule to compute ∇p J, the gradient of the loss function
with respect to each parameter of the neural network. This process is also referred to as backpropagation,
since gradients are ‘propagated backward’ through the network using the chain rule.
Let’s compute the gradient for the parameters in the last layer (2) of our network:
h P i
(2) C (2)
−
" #
∂CE(y, ŷ) ∂
(2)
exp(zc ) ∂ z c log i=0 exp(z i
(2)
= − (2) log P (2)
=− (2)
C
∂zk ∂zk i=0 exp(zi ) ∂zk

There are two cases here:

1. Case I: k = c = yi , i.e., k is the correct label
(2)
∂CE(y, ŷ) exp(zk )
(2)
= −1 + C = −1 + ŷk = ŷk − yk
∂zk P (2)
exp(zi )
i=0

2. Case II: k ̸= yi
(2)
∂CE(y, ŷ) exp(zk )
(2)
=0+ C = ŷk − yk
∂zk P (2)
exp(zi )
i=0

Therefore, the gradient in vector notation simplifies to

∂CE(y, ŷ)
= ŷ − y (2)
∂z (2)
Recall that z (2) = W (2) a(1) + b(2) , such that z (2) ∈ RH2 ×1 , a(1) ∈ RH1 ×1 and W (2) ∈ RH2 ×H1 . Therefore,

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (2)

=
∂W (2) ∂z (2) ∂W (2)

∂CE(y, ŷ)
= (ŷ − y)[a(1) ]T (3)
∂W (2)

8
Similarly,
∂CE(y, ŷ)
= ŷ − y (4)
∂b(2)
Going across L2 :

∂z (2)
= [W (2) ]T
∂a(1)
∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (2)
= = [W (2) ]T (ŷ − y)
∂a(1) ∂z (2) ∂a(1)
Going across the non-linearity of L1 :

∂CE(y, ŷ) ∂CE(y, ŷ) ∂σ(z (1) )

=
∂z (1) ∂a(1) ∂z (1)
∂CE(y, ŷ)
= ◦ σ(z (1) ) ◦ (1 − σ(z (1) ))
∂a(1)
Note that we have assumed that σ(·) works on vectors (matrices) by applying an element-wise sigmoid,
and ◦ is the element-wise (Hadamard) product.
That brings us to our final gradients:

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (1)

=
∂W (1) ∂z (1) ∂W (1)

∂CE(y, ŷ) ∂CE(y, ŷ) T

= x (5)
∂W (1) ∂z (1)
Similarly,
∂CE(y, ŷ) ∂CE(y, ŷ)
= (6)
∂b(1) ∂z (1)
The above equations have been derived for a single training vector, but they extend seamlessly to a
matrix of N column vectors. In that case, you need to sum up over all the input images x.

Gradient Descent
Gradient Descent is an iterative algorithm for finding local minima of a function parameterized by some
parameters p. The gradient descent update rule is

p ← p − α ∇p J (7)

where α is the learning rate that controls how large the descent step is. ∇p J is the gradient of J with
respect to the network parameters p.
N
CE(i) since this requires computing CE(i) for all i =
P
In practice, we often do not compute J =
i=1
1, . . . , N . Instead, we divide the input into ‘mini-batches’ containing M images and process one mini-batch
k+M
CE(i) (where x(k) is
P
at a time until all images are processed. For each mini-batch we calculate Jmb =
i=k
the first image in the mini-batch), and update the network parameters p according to the update rule

p ← p − α ∇p Jmb . (8)

9
This algorithm is also referred to as Mini-batch Gradient Descent. See below for the pseudo-code, where an
‘epoch’ refers to a single iteration over all N images and corresponds to ⌈M/N ⌉ updates to the parameters
p. This approach usually leads to faster convergence than Batch Gradient Descent (or simply Gradient
Descent) since we update the network coefficients more than once per epoch.

Algorithm 1 Mini-batch Gradient Descent

epoch ← 0
while epoch < MAX_EPOCHS do
batches ← split(training_samples, M )
for batch in batches do
p ← p − step × gradient(batch)
end for
epoch ← epoch + 1
end while

Student's Guide to Neural Networks
No ratings yet
Student's Guide to Neural Networks
40 pages
Assignment 2 - Neural Network Fundamentals
No ratings yet
Assignment 2 - Neural Network Fundamentals
7 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
Neural Networks for Visual Recognition
No ratings yet
Neural Networks for Visual Recognition
12 pages
Neural Network Code Modifications
No ratings yet
Neural Network Code Modifications
4 pages
Neural Network Assignment Guide
No ratings yet
Neural Network Assignment Guide
6 pages
Control System Term Paper
No ratings yet
Control System Term Paper
12 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
Manual - Deep Learning Lab.
No ratings yet
Manual - Deep Learning Lab.
43 pages
Machine Learning Model Training Insights
No ratings yet
Machine Learning Model Training Insights
60 pages
A Introduction To Artificial Neural Network Library in C PDF
No ratings yet
A Introduction To Artificial Neural Network Library in C PDF
4 pages
Neural Networks for NLP: Training & Concepts
No ratings yet
Neural Networks for NLP: Training & Concepts
14 pages
Neural Networks with Keras: MNIST Guide
No ratings yet
Neural Networks with Keras: MNIST Guide
44 pages
Introduction to Artificial Neural Networks
No ratings yet
Introduction to Artificial Neural Networks
31 pages
Neural Networks in NLP: Training & Concepts
No ratings yet
Neural Networks in NLP: Training & Concepts
14 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
C++ Neural Network Guide
No ratings yet
C++ Neural Network Guide
30 pages
Neural Networks and Deep Learning Overview
No ratings yet
Neural Networks and Deep Learning Overview
28 pages
10 Neural Nets
No ratings yet
10 Neural Nets
61 pages
Hacking Neural Networks: A Short Introduction
No ratings yet
Hacking Neural Networks: A Short Introduction
50 pages
Deep Learning Lab File for BCA Students
No ratings yet
Deep Learning Lab File for BCA Students
46 pages
Neural Networks: Feature Learning Basics
No ratings yet
Neural Networks: Feature Learning Basics
62 pages
Deep Learning Course Overview and Requirements
No ratings yet
Deep Learning Course Overview and Requirements
49 pages
Chapter 3-1 Neural Network
No ratings yet
Chapter 3-1 Neural Network
43 pages
Assignment B 3 Customer Churn Modeling
No ratings yet
Assignment B 3 Customer Churn Modeling
7 pages
Neural Networks in Statistical Analysis
No ratings yet
Neural Networks in Statistical Analysis
41 pages
Deep Learning Models (Basic)
No ratings yet
Deep Learning Models (Basic)
35 pages
Week 2 Artificial Neural Networks - Part II
No ratings yet
Week 2 Artificial Neural Networks - Part II
40 pages
Neural Networks for Prediction & Control
No ratings yet
Neural Networks for Prediction & Control
137 pages
Sigmoid Neural Networks for Digit Prediction
No ratings yet
Sigmoid Neural Networks for Digit Prediction
16 pages
ML Lab 11 Manual - Neural Networks (Ver4)
No ratings yet
ML Lab 11 Manual - Neural Networks (Ver4)
8 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Keras Multilayer Perceptron Guide
No ratings yet
Keras Multilayer Perceptron Guide
9 pages
Keras Perceptron and MLP Guide
No ratings yet
Keras Perceptron and MLP Guide
9 pages
Gen Ai Mynotes
No ratings yet
Gen Ai Mynotes
12 pages
Introduction to Neural Networks
No ratings yet
Introduction to Neural Networks
27 pages
CSCE 636: Deep Learning
No ratings yet
CSCE 636: Deep Learning
30 pages
FPGA Based Implementation of Neural Network
No ratings yet
FPGA Based Implementation of Neural Network
5 pages
Practical Work in Neural Networks
No ratings yet
Practical Work in Neural Networks
60 pages
TensorFlow and CNTK Deep Learning Guide
No ratings yet
TensorFlow and CNTK Deep Learning Guide
10 pages
Deep Learning Fundamentals Explained
No ratings yet
Deep Learning Fundamentals Explained
111 pages
Deep Learning Basics with TensorFlow
100% (1)
Deep Learning Basics with TensorFlow
50 pages
01 02 Intro
No ratings yet
01 02 Intro
11 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
Machine Learning for Embedded AI
No ratings yet
Machine Learning for Embedded AI
58 pages
Machine Learning Lecture II Overview
No ratings yet
Machine Learning Lecture II Overview
118 pages
BuildingABrain - Ipynb - Colab
No ratings yet
BuildingABrain - Ipynb - Colab
10 pages
Deep Learning for CS Students
No ratings yet
Deep Learning for CS Students
21 pages
Course Contents #1
No ratings yet
Course Contents #1
24 pages
Project Report 4th Year
No ratings yet
Project Report 4th Year
43 pages
Neural Network Implementation for Diabetes
No ratings yet
Neural Network Implementation for Diabetes
24 pages
Intro to Neural Networks with Python
100% (1)
Intro to Neural Networks with Python
85 pages
Week 1: Introduction to Deep Learning
No ratings yet
Week 1: Introduction to Deep Learning
56 pages
PRIVATE TREATY - August 2025 2 20250812123003
No ratings yet
PRIVATE TREATY - August 2025 2 20250812123003
7 pages
Organizing and Hosting A Christian Concert
No ratings yet
Organizing and Hosting A Christian Concert
8 pages
Bus Ticket Format 04
No ratings yet
Bus Ticket Format 04
1 page
F-LSK 425 Assignment 2 ASG/ Memorandum
No ratings yet
F-LSK 425 Assignment 2 ASG/ Memorandum
18 pages
Cuadro Comparativo de Tiempos Verbales
No ratings yet
Cuadro Comparativo de Tiempos Verbales
10 pages
NJIT CS631 Relational Calculus Quiz
No ratings yet
NJIT CS631 Relational Calculus Quiz
8 pages
Stata Cheatsheet Top30
No ratings yet
Stata Cheatsheet Top30
1 page
Computer Memory Fundamentals
No ratings yet
Computer Memory Fundamentals
27 pages
HSNS170 Assessment 2
No ratings yet
HSNS170 Assessment 2
4 pages
Project Charter Template-CERS
No ratings yet
Project Charter Template-CERS
10 pages
Echowalkers
No ratings yet
Echowalkers
18 pages
1 24 Professional Conduct
No ratings yet
1 24 Professional Conduct
151 pages
CÂU HỎI ÔN TẬP
No ratings yet
CÂU HỎI ÔN TẬP
2 pages
Boho
100% (3)
Boho
8 pages
Unit - (Teaching Aptitude) PYQs (2024-25) Solutions
No ratings yet
Unit - (Teaching Aptitude) PYQs (2024-25) Solutions
15 pages
Win 11 Security - Part 1
No ratings yet
Win 11 Security - Part 1
7 pages
Vinyl Chloride Safety Data Sheet
No ratings yet
Vinyl Chloride Safety Data Sheet
9 pages
Political Communication on Social Media
No ratings yet
Political Communication on Social Media
15 pages
Mentor Assessment: Software Development
No ratings yet
Mentor Assessment: Software Development
5 pages
Class 4 Maths Work Book Sem1 PDF
100% (1)
Class 4 Maths Work Book Sem1 PDF
88 pages
1st ESO Biology Topic 2-Anaya
No ratings yet
1st ESO Biology Topic 2-Anaya
6 pages
Rousseau
No ratings yet
Rousseau
6 pages
School Location & Achievement in Ekiti
No ratings yet
School Location & Achievement in Ekiti
6 pages
Electrical Plan Cost: R:philippines
No ratings yet
Electrical Plan Cost: R:philippines
1 page
Chapter 2
No ratings yet
Chapter 2
18 pages
Technology in Indian Policing
No ratings yet
Technology in Indian Policing
12 pages
Integrative Reflection Paper in Understanding The Self PDF
100% (2)
Integrative Reflection Paper in Understanding The Self PDF
12 pages
Mumbai Dabbawala Supply Chain Insights
No ratings yet
Mumbai Dabbawala Supply Chain Insights
21 pages
Arecanut Disease Detection with CNN
No ratings yet
Arecanut Disease Detection with CNN
18 pages
Risk Management
No ratings yet
Risk Management
2 pages

HW 5

Uploaded by

HW 5

Uploaded by

CME 213, Introduction to parallel computing

Total number of points: 100.

Problem 1 Neural Network on the GPU

1. Code: The completed class DeviceNeuralNetwork in neural_network.cpp.

1. Code: The completed function DeviceNeuralNetwork::to_cpu in neural_network.cpp.

1. Code: The completed constructor GPUGrads::GPUGrads in neural_network.cpp.

Problem 2 Forward Pass

1. Code: The completed function DeviceNeuralNetwork::forward in neural_network.cpp.

1. Code: The completed function DeviceNeuralNetwork::loss in neural_network.cpp.

Problem 4 Backward Pass

1. Code: The completed function DeviceNeuralNetwork::backward in neural_network.cpp.

Problem 5 Optimizer Step

1. Code: The completed function DeviceNeuralNetwork::step in neural_network.cpp.

Problem 6 Profiling with Nsight

nsys profile --trace=cuda,nvtx,osrt --output=mainq6_profile --force-overwrite=true main_q6

A Running the Code

2. You can compile the code using make or

C Neural Networks on CUDA

Model: Neural Networks

Figure 2: A single neuron.

tanh(x) Sigmoid ReLU

Figure 4: Fully connected feedforward neural network with two layers.

Fully connected feedforward neural network

z (1) = W (1) x + b(1)

The softmax function is defined by:

CE stands for cross-entropy. Since y is a one-hot vector, this simplifies to

CE(y, ŷ) = − log (ŷc )

There are two cases here:

Therefore, the gradient in vector notation simplifies to

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (2)

∂CE(y, ŷ) ∂CE(y, ŷ) ∂σ(z (1) )

∂CE(y, ŷ) ∂CE(y, ŷ) ∂z (1)

∂CE(y, ŷ)  ∂CE(y, ŷ)  T

Algorithm 1 Mini-batch Gradient Descent

You might also like

∂CE(y, ŷ) ∂CE(y, ŷ) T