Online Architecture Assignment Help
Online Architecture Assignment Help
To serve a memory request, the memory controller issues one or multiple DRAM
commands to access data from a bank. There are four different DRAM commands as
discussed in class.
• ACTIVATE: Loads the row (that needs to be accessed) into the bank’s row-buffer.
This is called opening a row. (Latency: 15ns)
• PRECHARGE: Prepares the bank for the next access by closing the row in the bank
(and making the row buffer empty). (Latency: 15ns)
• READ/WRITE: Accesses data from the row-buffer. (Latency: 15ns)
The diagrams below show the snapshots of memory controller’s request queues at time
0, i.e., t0, when applications A, B, and C are executed together on a multi-core
processor. Each application runs on a separate core but shares the memory subsystem
with other applications. Each request is color-coded to denote the application to which
it belongs. Additionally, each request is annotated with a number that shows the order of
the request among the set of enqueued requests of the application to which it belongs.
architectureassignmenthelp.com
For example, A3 means that this is the third request from application A enqueued in the
request queue. Assume all memory requests are reads and a read request is considered to
be served when the READ command is complete (i.e., 15 ns after the request’s READ
command is issued).
• The memory system has two DRAM channels, one DRAM bank per channel, and four
rows per bank.
• All the row-buffers are closed (i.e., empty) at time 0.
• All applications start to stall at time 0 because of memory.
• No additional requests from any of the applications arrive at the memory controller.
• An application (A, B, or C) is considered to be stalled until all of its memory requests
(across all the request buffers) have been served.
architectureassignmenthelp.com
4.2 Problem Specification
The below table shows the stall time of applications A, B, and C with the FCFS (First-
Come, FirstServed) and FR-FCFS (First-Ready, First-Come, First-Served) scheduling
policies.
The diagrams below show the scheduling order of requests for Channel 0 and Channel
1 with the FCFS and FR-FCFS scheduling policies.
architectureassignmenthelp.com
What are the numbers of row hits and row misses for each DRAM bank with either of
the scheduling policies? Show your work.
To calculate the number of hits and misses we should consider the following facts:
• The first request of each channel will be always a row-buffer miss and it requires
one ACTIVATE and one READ command which lead to a 30 ns delay.
• For all requests in each channel, except for the first one, a row-buffer miss requires
one PRECHARGE, one ACTIVATE, and one READ command which lead to a 45 ns
delay.
architectureassignmenthelp.com
• A row-buffer hit requires one READ command which leads to a 15 ns delay.
• The stall time of each application is equal to the maximum service time of its
requests in both Channel 0 and Channel 1.
• When using the FR-FCFS policy, the requests will be reordered to exploit row buffer
locality. For example, the B1 request in Channel 0 is reordered in FR-FCFS with
respect to FCFS and executed after C1, A3, B2, and B4 requests. This means that A2,
C1, A3, B2, and B4 are all accessing the same row.
architectureassignmenthelp.com
2 Branch Prediction
(a) A program with 1000 dynamic instructions completes in 2211 cycles. If 200 of those
instructions are conditional branches, at the end of which pipeline stage the branch
instructions are resolved? (Assume that the pipeline does not stall for any other reason
than the conditional branches (e.g., data dependencies) during the execution of that
program.)
architectureassignmenthelp.com
(b) In a new, higher-performance version of the processor, the architects implement a
mysterious branch prediction mechanism to improve the performance of the processor.
They keep the rest of the design exactly the same as before. The new design with the
mysterious branch predictor completes the execution of the following code in 115 cycles.
MOV R1, #0 / / R1 = 0
LOOP_1:
BEQ R1, #5, LAST // Branch to LAST if R1 == 5
ADD R1, R1, #1 // R1 = R1 + 1
MOV R2, #0 // R2 = 0
LOOP_2:
BEQ R2, #3, LOOP_1 // Branch to LOOP_1 if R2==3.
ADD R2, R2, #1 // R2 = R2 + 1
B LOOP_2 // Unconditional branch to LOOP_2
LAST:
MOV R1, #1 // R1 = 0
architectureassignmenthelp.com
Assume that the pipeline never stalls due to a data dependency. Based on the given
information, determine which of the following branch prediction mechanisms could be
the mysterious branch predictor implemented in the new version of the processor. For
each branch prediction mechanism below, you should circle the configuration parameters
that makes it match the performance of the mysterious branch predictor.
Explain:
YES, if the static prediction direction is always not taken.
architectureassignmenthelp.com
ii)] Last Time Branch Predictor
Could this be the mysterious branch predictor?
YES NO
If YES, for which configuration is the answer YES? Pick an option for each configuration
parameter.
ii. Local for each branch instruction (PC-based) or global (shared among all branches)
history?
NO.
Explanation: There is not a configuration for this branch predictor that results in 6
mispredictions for the above program.
Explain:
NO.
Explanation: BTFN predictor does not make exactly 6 mispredictions for the
above program.
YES NO
If YES, for which configuration is the answer YES? Pick an option for each
configuration parameter.
architectureassignmenthelp.com
ii. Local for each branch instruction (i.e., PC-based, without any interference between
different branches) or global (i.e., a single counter shared among all branches) history?
Local Global
Explain:
YES, if local history registers with 00 or 01 initial values are used.
3 SIMD
We have two SIMD engines: 1) a traditional vector processor and 2) a traditional array
processor. Both processors can support a vector length up to 16.
All instructions can be fully pipelined, the processor can issue one vector instruction per
cycle, and the pipeline does not forward data (no chaining). For the sake of simplicity,
we ignore the latency of the pipeline stages other than the execution stages (e.g, decode
stage latency: 0 cycles, write back latency: 0 cycles, etc).
architectureassignmenthelp.com
We implement the following instructions in both designs, with their corresponding
execution latencies:
architectureassignmenthelp.com
• Both processors have eight vector registers (VR0 to VR7) which can contain up to 16
elements, and eight scalar registers (R0 to R7). The entire vector register needs to be ready
(i.e., populated with all VLEN elements) before any element of it can be used as part of
another operation.
• The memory can sustain a throughput of one element per cycle. The memory consists of
16 banks that can be accessed independently. A single memory access can be initiated in
each cycle. The memory can sustain 16 parallel accesses if they all go to different banks.
(a) Which processor (array or vector processor) is more costly in terms of chip area?
Explain.
Array processor
(b) The following code takes 52 cycles to execute on the vector processor:
VADD VR2 ← VR1, VR0
VADD VR3 ← VR2, VR5
VMUL VR6 ← VR2, VR3
architectureassignmenthelp.com
What is the VLEN of the instructions? Explain your answer.
VLEN: 10
How long would the same code execute on an array processor with the same vector
length?
25 cycles
(c) The following code takes 94 cycles to execute on the vector processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]
architectureassignmenthelp.com
Assume that the elements loaded in VR0 are all placed in different banks, and that the
elements loaded into VR1 are placed in the same banks as the elements in VR0.
Similarly, the elements of VR2 are stored in different banks in memory. What is the
VLEN of the instructions? Explain your answer.
VLEN: 8
(d) We replace the memory with a new module whose characteristics are unknown. The
following code (the same as that in (c)) takes 163 cycles to execute on the vector
processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]
architectureassignmenthelp.com
The VLEN of the instructions is 16. The elements loaded in VR0 are placed in
consecutive banks, the elements loaded in VR1 are placed in consecutive banks, and the
elements of VR2 are also stored in consecutive banks. What is the number of banks of the
new memory module? Explain.
[Correction] The number of cycles should be 170 instead of 163. For grading this
question the instructor took into account only the student’s reasoning.
Number of banks: 8
architectureassignmenthelp.com
One real-world application that can benefit from Ambit’s in-DRAM bulk bitwise
operations is the database bitmap index, as we also discussed in the lecture. By using
bitmap indices, we want to run the following query on a database that keeps track of user
actions: “How many unique users were active every week for the past w weeks?" Every
week, each user is represented by a single bit. If the user was active a given week, the
corresponding bit is set to 1. The total number of users is u.
We assume the bits corresponding to one week are all in the same row. If u is greater
than the total number of bits in one row (the row size is 8 kilobytes), more rows in
different subarrays are used for the same week. We assume that all weeks corresponding
to the users in one subarray fit in that subarray. We would like to compare two possible
implementations of the database query:
• CPU-based implementation: This implementation reads the bits of all u users for the w
weeks. For each user, it ands the bits corresponding to the past w weeks. Then, it
performs a bit-count operation to compute the final result.
Since this operation is very memory-bound, we simplify the estimation of the execution
time as the time needed to read all bits for the u users in the last w weeks. The memory
bandwidth that the CPU can exploit is X bytes/s.
architectureassignmenthelp.com
• Ambit-based implementation: This implementation takes advantage of bulk and operations
of Ambit. In each subarray, we reserve one Accumulation row and one Operand row
(besides the control rows that are needed for the regular operation of Ambit). Initially, all
bits in the Accumulation row are set to 1. Any row can be moved to the Operand row by
using RowClone (recall that RowClone is a mechanism that enables very fast copying of a
row to another row in the same subarray). trc and tand are the latencies (in seconds) of
RowClone’s copy and Ambit’s and respectively. Since Ambit does not support bit-count
operations inside DRAM, the final bit-count is still executed on the CPU. We consider that
the execution time of the bit-count operation is negligible compared to the time needed to
read all bits from the Accumulation rows by the CPU.
(a) What is the total number of DRAM rows that are occupied by u users and w weeks?
T otalRows = [ u /8×8k] × w.
Explanation:
The u users are spread across a number of subarrays:
NumSubarrays = [u /8×8k].
architectureassignmenthelp.com
(b) What is the throughput in users/second of the Ambit-based implementation?
Explanation:
First, let us calculate the total time for all bulk and operations. We should add
trc and tand for all rows:
tand−total = [u 8×8k] × w × (trc + tand) seconds.
Then, we calculate the time needed to compute the bit count on CPU:
tbitcount = u/8/x = u/Xx8 seconds.
architectureassignmenthelp.com
(c) What is the throughput in users/second of the CPU implementation?
(d) What is the maximum w for the CPU implementation to be faster than the Ambit-
based implementation? Assume u is a multiple of the row size.
architectureassignmenthelp.com
5 BONUS: Caching vs. Processing-in-Memory
We are given the following piece of code that makes accesses to integer arrays A and
B. The size of each element in both A and B is 4 bytes. The base address of array A is
0x00001000, and the base address of B is 0x00008000.
architectureassignmenthelp.com
Outer_Loop:
movi R4, #0
movi R7, #0
Inner_Loop:
add R5, R3, R4 // R5 = R3 + R4
// load 4 bytes from memory address R1+R5
ld R5, [R1, R5] // R5 = Memory[R1 + R5],
ld R6, [R2, R4] // R6 = Memory[R2 + R4]
mul R5, R5, R6 // R5 = R5 * R6
add R7, R7, R5 // R7 += R5
inc R4 // R4++
bne R4, #2, Inner_Loop // If R4 != 2, jump to Inner_Loop
architectureassignmenthelp.com
You are running the above code on a single-core processor. For now, assume that the
processor does not have caches. Therefore, all load/store instructions access the main
memory, which has a fixed 50- cycle latency, for both read and write operations. Assume
that all load/store operations are serialized, i.e., the latency of multiple memory requests
cannot be overlapped. Also assume that the execution time of a non-memory-access
instruction is zero (i.e., we ignore its execution time).
(a) What is the execution time of the above piece of code in cycles?
4000 cycles.
Explanation: There are 5 memory accesses for each outer loop iteration. The outer
loop iterates 16 times, and each memory access takes 50 cycles. 16 ∗ 5 ∗ 50 = 4000
cycles
(b) Assume that a 128-byte private cache is added to the processor core in the next-
generation processor. The cache block size is 8-byte. The cache is direct-mapped. On a
hit, the cache services both read and write requests in 5 cycles. On a miss, the main
memory is accessed and the access fills an 8-byte cache line in 50 cycles. Assuming that
the cache is initially empty, what is the new execution time on this processor with the
described cache? Show your work.
architectureassignmenthelp.com
900 cycles.
Explanation. At the beginning A and B conflict in the first two cache lines. Then
the elements of A and B go to different cache lines. The total execution time is 1910
cycles. Here is the access pattern for the first outer loop iteration:
0 − A[0], B[0], A[1], B[1], A[0]
The first 4 references are loads, the last (A[0]) is a store. The cache is initially empty.
We have a cache miss for A[0]. A[0] and A[1] is fetched to 0th index in the cache.
Then, B[0] is a miss, and it is conflicting with A[0]. So, A[0] and A[1] are evicted.
Similarly, all cache blocks in the first iteration are conflicting with each other. Since
we have only cache misses, the latency for those 5 references is 5 ∗ 50 = 250 cycles
The status of the cache after making those seven references is:
architectureassignmenthelp.com
Cache hits/misses in the order of the references:
H, M, M, H, M
Latency = 2 ∗ 5 + 3 ∗ 50 = 165 cycles
Cache Status:
- A(0,1) is in set 0
- A(2,3) is in set 1
-the rest of the cache is empty
Cache hits/misses:
H, M, H, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles
Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-the rest of the cache is empty
architectureassignmenthelp.com
Cache hits/misses:
H, H, M, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles
Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty
Cache Status:
-B(0,1) is in set 0
-B(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty
architectureassignmenthelp.com
After this point, single-miss and zero-miss (all hits) iterations are interleaved until the
16th iteration.
Overall Latency:
165 + 70 + (70 + 25) ∗ 7 = 900 cycles
(c) You are not satisfied with the performance after implementing the described cache.
To do better, you consider utilizing a processing unit that is available close to the main
memory. This processing unit can directly interface to the main memory with a 10-cycle
latency, for both read and write operations. How many cycles does it take to execute the
same program using the inmemory processing units? (Assume that the in-memory
processing unit does not have a cache, and the memory accesses are serialized like in
the processor core. The latency of the non-memory-access operations is ignored.)
800 cycles.
Explanation: Same as for the processor core without a cache, but the memory
access latency is 10 cycles instead of 50. 16 ∗ 5 ∗ 10 = 800
architectureassignmenthelp.com
(d) You friend now suggests that, by changing the cache capacity of the single-core
processor (in part (b)), she could provide as good performance as the system that utilizes
the memory processing unit (in part (c)).
Is she correct? What is the minimum capacity required for the cache of the single-core
processor to match the performance of the program running on the memory processing
unit?
No, she is not correct.
Explanation: Increasing the cache capacity does not help because doing so
cannot eliminate the conflicts to Set 0 in the first two iterations of the outer loop.
(e) What other changes could be made to the cache design to improve the
performance of the single-core processor on this program?
Explanation: Although there is enough cache capacity to exploit the locality of the
accesses, the fact that in the first two iterations the accessed data map to the same set
causes conflicts. To improve the hit rate and the performance, we can change the
addressto-set mapping policy. For example, we can change the cache design to be set-
associative or fully-associative.
architectureassignmenthelp.com