Proj Overview
Proj Overview
Fall 2023
Project Overview:
Matrix-Sparse Vector Mult. Hardware
Issued: 9/25/23, Due: 12/8/23, 11:59 PM
This project specification is contained in six documents. This document contains the overview of
the project. (You should start here.) Please see other accompanying PDFs for detailed specifications
and tasks for each of the five tasks (“Part 1” through “Part 5”) of the project.
1. Introduction
In this project, you will design, implement, simulate, and synthesize a hardware system for
performing matrix-vector multiplication, where the matrix is dense, and the vector is sparse. This
is called “matrix-sparse vector multiplication,” which we will abbreviate as MSpVM. You will
turn in:
- your documented and commented code
- clearly labeled synthesis reports
- a report answering all questions and including requested information
I will run additional simulations on the code you turn in, so it is very important to:
1. Make sure your designs simulate correctly using QuestaSim on the lab computers or the
CAD servers.
2. Carefully organize your code as specified in this document.
3. Make sure the names and behavior of all signals match this specification exactly.
4. Carefully label and document your code.
Your project will be evaluated on correctness and efficiency of your design, the quality of your
report, and your answers to questions in the report.
You may work alone or with one partner on this project. You may not share code with others
(except your partner). This means you may not allow others to see your code, nor may you
read others’ code (for this or related projects). All code will be run through an automatic
code comparison tool. Plagiarism will result in a score of zero on the assignment for all
involved parties. If you have questions as to what is acceptable, please come to office hours or
send Prof. Milder email to ask for clarification.
If you have general questions about the project, please post them Piazza.
File Organization
This project is broken into five parts. To make it possible for me to grade and understand your
work, please carefully organize your files. Use a separate sub-directory for each part (called
part1/ part2/ part3/ part4/ and part5/). Then be sure to name your files and modules as
specified in the description below. Make sure all your files are stored inside of a private work
directory like the ese507work directory you made in the HW2 Tool Tutorial.
Point Breakdown
1. Part 1: Multiply-Accumulate Unit [15 points]
2. Part 2: Output FIFO [15 points]
3. Part 3: Input Memory Module [20 points]
4. Part 4: Matrix-Sparse Vector Multiplier (MSpVM) [25 points]
5. Part 5: Throughput Optimization [20 points]
6. Quality of report, code, comments, and organization [5 points]
Getting Started
As you can see, this project is large and complex. This document provides a high-level overview
and some important background information. Then, the accompanying documents give the
specification and tasks for each of the five parts of the project. Begin by carefully reading this
overview, and then you should spend some time looking through Parts 1–5. Then when you are
ready to start working, see the Part 1 document.
2. Partner
You may work alone or in a team of two students for this project. If you choose to work with a
partner, it is important that both partners contribute fully to all phases of the project. Your report
will require you to describe each partner’s contribution to the project. Unequal contributions may
be reflected in scoring.
If you are choosing to work with a partner, by Monday 10/2 at 11:59pm you must:
• Send an email to [email protected] with the subject “ESE 507 Project Partner
Signup”
• Send the email from your @stonybrook.edu email address
• In the body of the email, write both your name and your partner’s name
• CC your partner on the email (using your partner’s @stonybrook.edu email address)
After this, you are committed to work with this partner on this project for the entire semester.
3. Background
3.1 Matrix-Vector Multiplication
We first begin by reviewing matrix-vector multiplication. As an example, let W represent a square
3´3 matrix, and let x represent a (column) vector of length 3. The product y = W x is defined as:
2 3 2 32 3 2 3
y0 w0,0 w0,1 w0,2 x0 w0,0 x0 + w0,1 x1 + w0,2 x2
4y1 5 = 4w1,0 w1,1 w1,2 5 4x1 5 = 4w1,0 x0 + w1,1 x1 + w1,2 x2 5
y2 w2,0 w2,1 w2,2 x2 w2,0 x0 + w2,1 x1 + w2,2 x2
So, this system takes in 12 values (the 3´3 matrix W3 and the 3´1 column vector x) and produces 3
values (3´1 column vector y).
We can also use array notation and represent this operation as computing (for m = 0,1,2):
2
X
<latexit sha1_base64="xrW45nkOpUls4QagyiBGDNhJylA=">AAACEXicbZDLSgMxFIbPeK31NurSTbAIrsqMiLopFN24rGAvMB2HTJpqMMkMSUYsQ1/Bja/ixoUibt25821Mp11o6w+Bj/+cw8n545QzbTzv25mbX1hcWi6tlFfX1jc23a3tlk4yRWiTJDxRnRhrypmkTcMMp51UUSxiTtvx3fmo3r6nSrNEXplBSkOBbyTrM4KNtSL3YBCIENVQV2ciymXNG17nh0PUtm4gQ9QlvcSgB4uRW/GqXiE0C/4EKnUXCjUi96vbS0gmqDSEY60D30tNmGNlGOF0WO5mmqaY3OEbGliUWFAd5sVFQ7RvnR7qJ8o+aVDh/p7IsdB6IGLbKbC51dO1kflfLchM/zTMmUwzQyUZL+pnHJkEjeJBPaYoMXxgARPF7F8RucUKE2NDLNsQ/OmTZ6F1WPWPq0eXR5X62TgNKMEu7MEB+HACdbiABjSBwCM8wyu8OU/Oi/PufIxb55zJzA78kfP5A007nMA=</latexit>
So, each output value y[m] is computed by multiplying and adding the appropriate values of the
matrix W and input vector x.
3 2 32 3
y0 W0,0 W0,1 ... W0,N 1 x0
6 y1 7 6 W1,0 W1,1 ... W1,N 76 x1 7
6 7 6 1 76 7
6 .. 7=6 .. .. .. .. 76 .. 7
4 . 5 4 . . . . 54 . 5
yM 1 WM 1,0 WM 1,1 ... WM 1,N 1 xN 1
Or, in pseudocode:
Computing each of the M values in y requires performing N multiplications and summing up their
results. In total, this requires MN multiplications and M(N–1) additions.
In this project, M and N will be parameters of your hardware system. That is, you will design a
system in SystemVerilog that has parameters M and N which can be changed in the code.
We will use D to denote the number of non-zero entries in vector x. (Necessarily, 1 ≤ D ≤ N.) Here
is an example of MSpVM of a dense 3x4 matrix with a sparse vector that has two non-zero entries.
(That is, M=3, N=4, and D=2.) 2 3
2 3 2 3 4 2 3 2 3
<latexit sha1_base64="F4crmD7LuPJoCPT7sjKhj8CtCBA=">AAADKHichVJLb9NAEF67PIp5NKVHLisiEBJS5E3cJD2AIrhwLBJpK8WRtV5v0lXXa2t3jbCs/pxe+CtcKgRCvfJLGDsRAQeJkWbn8zczOw9vnEthrO/fOO7Ordt37u7e8+4/ePhor7P/+MRkhWZ8yjKZ6bOYGi6F4lMrrORnueY0jSU/jS/e1v7Tj1wbkakPtsz5PKVLJRaCUQtUtO+8DmO+FKqKU2q1+HTplZGPwxCXEVmZvhdylWz8r7x2BsHPcR90ABqEoXcIdgg6Ah3D9xFY4tdHHUnaF7avC+q6/u9j0K6PtxsIWZJZHOCXUL+Bg7qNDTvesEcblvTX9P8nHEJmUM82anUfdbp+z28EbwOyBt1JBzVyHHW+hknGipQryyQ1Zkb83M4rqq1gksM2CsNzyi7oks8AKppyM6+aH32JnwGT4EWmQZXFDftnRkVTY8o0hkjo79y0fTX5L9+ssIvxvBIqLyxXbFVoUUhsM1y/GpwIzZmVJQDKtIBeMTunmjILb8uDJZD2yNvgpN8jw97h+6A7ebPaBtpFT9BT9AIRNEIT9A4doylizpXzxfnmfHc/u9fuD/dmFeo665wD9Je4P38BACTx/w==</latexit>
y0 1 2 3 4 6 7 1·4+4·3 16
4y1 5 = 45 6 7 8 5 607 = 4 5 · 4 + 8 · 3 5 = 4445
4 05
y2 9 10 11 12 9 · 4 + 12 · 3 72
3
Notice how we can skip computations related to the entries of x that are equal to 0 since they cannot
contribute to the result.
When working with sparsity, we use a sparse encoding to represent sparse data. This will allow us
to compactly represent the non-zero parts of the vector without keeping storing all of the 0s, and it
will allow us to build hardware that only performs arithmetic on the non-zero portions of the data.
To do so, we will use a simple format based on Compressed Sparse Column (CSC) encoding1 to
represent our sparse input vector x. In this encoding, only the non-zero entries of the vector are
stored, but alongside of each value, we must store which row that it belongs to. For example, we
would store the value of x in the example above as val = [4, 3] and row = [0, 3]. So, this
tells us that the value 4 is in row 0, the value 3 is in row 3, and all other values are 0.
As another example, if N = 10, and x is represented by val = [1, 2] and row = [5, 9], then this
corresponds to a column vector with values [0, 0, 0, 0, 0, 1, 0, 0, 0, 2] (and D=2).
Compressing our sparse vector in this way obviously can make it smaller (if D is small), but it has
another benefit: it allows hardware or software to perform computations while skipping the 0
entries. In pseudocode (where the matrix has M rows and N columns, and the vector, which has D
non-zero entries, is encoded in the sparse formatted described above):
To make sure you understand this pseudocode, it is useful to work out the 3x4 example given above
(where val = [4, 3] and row = [0, 3]).
Recall from above that a dense matrix-vector multiplication requires MN multiplications and M(N–
1) additions. Now, we can see that with a sparse vector with D non-zero entries, MSpVM requires
only MD multiplications and M(D–1) additions. If D is much smaller than N, this is a large
reduction in the number of computations to perform. (E.g., if N = 1000 and D = 10, you have
eliminated 99% of the computation.)
Your goal in this project is to build a hardware system that computes the product of a dense MxN
matrix with a sparse vector. Your SystemVerilog code will be parameterized (using SystemVerilog
1
When applied on matrices (instead of vectors), the CSC encoding is slightly more complex than this;
in this project we simplify the representation because only our vector is sparse.
parameters) to allow the values of M and N to be easily changed in the code. The value of D will
vary based on the vector you give your system as input. The following section introduces the
specified structure of the design, its parameters, and the protocols it uses for input and output. Your
system will take in a stream of values that represent a matrix and a sparse vector, compute the
MSpVM, and output the result vector. Then your system will take in new inputs and repeat the
process.
Figure 2 illustrates a high-level block diagram of the system you will construct. Each of the
components will be specified and described in more detail in the following sections.
vector row
encodings
• Your system will take as input a stream of data in AXI-Stream format that represents a
matrix and a sparse input vector. (The sparse vector is encoded as described in Section 3
above.) Your system will perform a MSpVM of these and produce the result as output. The
system’s output values will be provided in AXI-Stream format. (The AXI-Stream protocol
and its use are described below in Section 5.) After completing a MSpVM, your system
will accept a new set of inputs to compute. (In other words, your system will keep
computing matrix-sparse vector products as long as new inputs are provided.)
f += a*b
Take note of how this is the fundamental operation used in the matrix-vector produce
pseudocode described above. The MAC unit is Part 1 of the project, and it is described in
the Project Part 1 document.
• As the MAC unit computes values of the output vector, it places them in the Output FIFO
module, which is Part 2 of the project. The Output FIFO module will buffer the values and
output them from your system in AXI Stream format. This module is described in the
Project Part 2 document.
• Your system will also require input memories to store the matrix and vector values while
the system performs the computation. These are stored in the Input Memory module, which
is Part 3 of the project. This module will include a memory for the matrix, a memory for
the vector, and necessary control logic. You can read more about this in the Project Part 3
document.
• The goal of Part 4 of the project will be to integrate the three components from Parts 1–3
and design accompanying control logic that will allow the components to work together to
perform the full matrix-sparse-vector product. The control logic will be responsible for
coordinating the operation of the input memories, MAC unit, and output memories. You
can read more about Part 4 in the Project Part 4 document.
• Lastly, the goal of Part 5 of the project will be to optimize the speed of the system. Please
see the Project Part 5 document.
Parameters
Rather than building hardware for a specific matrix size, you will design a parameterized system
to allow flexibility in the matrix/vector dimensions and in the number of bits used for input and
output values. This means that it will use the following SystemVerilog parameters:
• M: the number of rows of the matrix and rows of the output vector. Your system must
support M ≥ 2.
• N: the number of columns of the matrix and rows of the input vector. Your system must
support N ≥ 2.
• INW: the input bit width (the number of bits used per value in the input matrix and input
vector). Your system must support 2 ≤ INW < 32 bits.
• OUTW: the output bit width (the number of bits used per value in the output vector). Your
system must support 4 ≤ OUTW ≤ 64.
o OUTW must also be large enough to prevent overflow. Please see explanation in the
Project Part 1 document.
There is no defined upper bound on the limit of M and N, but as they get larger, the simulation and
synthesis time will grow.
AXI-Stream is shown in its simplest form in Figure 3. It is a synchronous protocol (meaning both
sides share a common clock) that allows a transmitter2 module to transfer data to a receiver when
both sides “agree.”
TVALID
clk
Figure 3. Simplified AXI-Stream protocol signals
The transmitter asserts the TVALID signal when it has placed valid data on the TDATA signal. The
destination asserts the TREADY signal when it is capable of consuming that data. On any positive
clock edge, data is transferred if (and only if) both the TVALID and the TREADY signals are asserted.
(No data will ever be transferred unless both are asserted.) TVALID and TREADY are 1-bit signals,
while TDATA is multiple bits.
2
In earlier versions of the AXI standard, the transmitter was called a “master,” and the receiver
was called a “slave.” This terminology was changed to “transmitter” and “receiver” in ARM’s 2021
standard, although you will still see the older terms used in some places and CAD tools.
Note that both source and destinations modules must share a common clock. We will call this
collection of signals (TVALID, TREADY, and TDATA) an “AXI-Stream interface.” Figure 4 and Table
1 illustrate this functionality and timing. In this example d[0], d[1], etc., represent the data words
transmitted.
clock
TVALID
TREADY
1 2 3 4 5 6 7 8 9
The interaction of the TVALID and TREADY signals is called a handshake. Think of asserting
TVALID as the transmitter holding out a hand; think of asserting TREADY as the receiver holding
out a hand. If both sides hold out their hand, then they shake hands and agree that a data transfer is
complete.
In our simplified AXI-Stream protocol, the transmitter is not permitted to wait until TREADY is
asserted before asserting TVALID, and the receiver is not permitted to wait until TVALID is asserted
before asserting TREADY.3 In other words, both the transmitter and the receiver need to decide
independently whether to assert their signal. (Then at the positive clock edge, they each check to
see if the other side has asserted theirs.)
TVALID
TREADY
TLAST
clk
Figure 5. AXI-Stream signals including TUSER and TLAST.
• TUSER is a multi-bit signal that transmits “sideband data” from the transmitter to the
receiver. Think of this as extra information that we transmit alongside of TDATA. This
signal is controlled in exactly the same way as TDATA: Anytime TREADY and TVALID are
1 on a positive clock edge, then the information on TUSER is also transmitted.
• TLAST is a 1-bit signal that the transmitter can use to indicate that the currently transmitted
data is the end of a transfer. Like TDATA and TUSER, this signal will be ignored except
when TREADY and TVALID are asserted on a positive clock edge.
Output Stream
Your system will utilize TDATA, TREADY, and TVALID for its output. (For output data, you will not
use TUSER or TLAST.) Your Output FIFO module (Part 2) will serve as the transmitter. When you
simulate, the testbench will be the receiver for this output data. (In a real system, the receiver would
3
One small difference between the full AXI-Stream protocol and our simplified version is that in the
complete AXI-Stream Protocol, the receiver may choose to wait until TVALID is asserted before
asserting TREADY, although we will not allow it in our project.
Another difference between our simplified AXI-Stream and the full protocol is that in the full protocol,
once the transmitter asserts TVALID, it must keep it asserted until the handshake occurs; we will allow
it to be de-asserted at any time.
be whatever component your system connects to.) So, your Output FIFO will have outputs TDATA
and TVALID and input TREADY. The testbench will have inputs TDATA and TVALID and output
TREADY.
Input Stream
Your system’s input interface will use all five signals (including TUSER and TLAST). Your Input
Memory module (Part 3) will serve as the receiver. When you simulate, the testbench will be the
transmitter for this input data. For details about how the input data are provided in this stream, and
how TUSER and TLAST are used to transmit sparse vectors, please see the Project Part 3 document.
1. Code
For your code and synthesis reports, you will turn in a single .zip, .tar, or .tgz file to
Brightspace. Do not use a different archive format (e.g., .rar). Seriously, please do not
use any archive format except .zip, .tar, or .tgz or you will lose points.
This compressed file should hold all of the code and synthesis reports from your project,
organized into part1/ through part5/ directories. I will be testing your designs using
my testbenches, so it is very important that your code sticks to the specification closely. I
will test your designs using the ECE grad lab computers so make sure everything runs
correctly there.
Do not turn in things like QuestaSim “work” directories or gate-level Verilog produced by
synthesis. Please only submit your actual code.
2. Synthesis Reports
Include the DesignCompiler synthesis report (in plaintext format) for each design you
synthesized. These should be included in the .zip, .tar, or .tgz archive file mentioned above.
Make sure these reports are clearly labeled. Please include them in the appropriate part1/
part2/ part3/ part4/ or part5/ directory.
3. Report
Please organize your report neatly. Use headings to separate it into Part 1, Part 2, Part 3,
Part 4, and Part 5. Each part of the project will have a numbered list of questions you should
answer. In your report, please use the same numbering to make it easier to find your
answers. (In other words, number your answers to match the questions in this assignment.)
In addition to the code submission, your report should be submitted (as a PDF file only)
alongside of your.zip, .tar, or .tgz archive. (Please include the PDF report separately from
the archive.) If you worked with a partner, make sure you answered the questions in
each Part where you are asked to explain each partner’s contribution to the project.
(If you worked alone obviously you can skip this.)
To create a .tgz file in Linux, first assemble a hand-in directory with copies of all of your
code, etc. For this example, let’s assume that directory is called handin/. Now, assuming
you are one directory above handin/, type the following:
This will create a gzipped-tar file (.tgz) that contains the entire handin/ directory
(including all of its contents).
You can test that it worked properly by copying the .tgz file you created to another
directory, and typing:
tar xvzf myhandin.tgz
This will extract the file into the directory you are currently in. If you have any problems
with this or anything else, please post them on Piazza.
Please, only use .zip, .tar, or .tgz files for your archive, and use PDF for your report. If you use
other formats, I will be unable to open your work on the lab computers, and you will lose points.
Your code archive should only contain your code and your synthesis reports with clearly labeled
names. Please do not submit the testbenches that were provided to you or other things like
QuestaSim “work” directories or gate-level Verilog produced by synthesis.