0% found this document useful (0 votes)
14 views

Online Architecture Assignment Help

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Online Architecture Assignment Help

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Online Architecture Assignment Help

For any Assignment related queries, call us at : - +1 678 648 4277


visit : - https://www.architectureassignmenthelp.com/, or
Email : - [email protected]
1 Memory Scheduling

1.1 Basics and Assumptions

To serve a memory request, the memory controller issues one or multiple DRAM
commands to access data from a bank. There are four different DRAM commands as
discussed in class.

• ACTIVATE: Loads the row (that needs to be accessed) into the bank’s row-buffer.
This is called opening a row. (Latency: 15ns)
• PRECHARGE: Prepares the bank for the next access by closing the row in the bank
(and making the row buffer empty). (Latency: 15ns)
• READ/WRITE: Accesses data from the row-buffer. (Latency: 15ns)

The diagrams below show the snapshots of memory controller’s request queues at time
0, i.e., t0, when applications A, B, and C are executed together on a multi-core
processor. Each application runs on a separate core but shares the memory subsystem
with other applications. Each request is color-coded to denote the application to which
it belongs. Additionally, each request is annotated with a number that shows the order of
the request among the set of enqueued requests of the application to which it belongs.

architectureassignmenthelp.com
For example, A3 means that this is the third request from application A enqueued in the
request queue. Assume all memory requests are reads and a read request is considered to
be served when the READ command is complete (i.e., 15 ns after the request’s READ
command is issued).

Assume also the following:

• The memory system has two DRAM channels, one DRAM bank per channel, and four
rows per bank.
• All the row-buffers are closed (i.e., empty) at time 0.
• All applications start to stall at time 0 because of memory.
• No additional requests from any of the applications arrive at the memory controller.
• An application (A, B, or C) is considered to be stalled until all of its memory requests
(across all the request buffers) have been served.

architectureassignmenthelp.com
4.2 Problem Specification

The below table shows the stall time of applications A, B, and C with the FCFS (First-
Come, FirstServed) and FR-FCFS (First-Ready, First-Come, First-Served) scheduling
policies.

Scheduling Application A Application B Application C

FCFS 195 ns 285 ns 135 ns


FR-FCFS 135 ns 225 ns 90 ns

The diagrams below show the scheduling order of requests for Channel 0 and Channel
1 with the FCFS and FR-FCFS scheduling policies.

architectureassignmenthelp.com
What are the numbers of row hits and row misses for each DRAM bank with either of
the scheduling policies? Show your work.

Channel 0, hits: FCFS: 3, FR-FCFS: 5

Channel 0, misses: FCFS: 5, FR-FCFS: 3

Channel 1, hits: FCFS: 2, FR-FCFS: 4

Channel 1, misses: FCFS: 6, FR-FCFS: 4

To calculate the number of hits and misses we should consider the following facts:

• The first request of each channel will be always a row-buffer miss and it requires
one ACTIVATE and one READ command which lead to a 30 ns delay.
• For all requests in each channel, except for the first one, a row-buffer miss requires
one PRECHARGE, one ACTIVATE, and one READ command which lead to a 45 ns
delay.

architectureassignmenthelp.com
• A row-buffer hit requires one READ command which leads to a 15 ns delay.
• The stall time of each application is equal to the maximum service time of its
requests in both Channel 0 and Channel 1.
• When using the FR-FCFS policy, the requests will be reordered to exploit row buffer
locality. For example, the B1 request in Channel 0 is reordered in FR-FCFS with
respect to FCFS and executed after C1, A3, B2, and B4 requests. This means that A2,
C1, A3, B2, and B4 are all accessing the same row.

architectureassignmenthelp.com
2 Branch Prediction

A processor implements an in-order pipeline with 12 stages. Each stage completes in a


single cycle. The pipeline stalls on a conditional branch instruction until the condition
of the branch is evaluated. However, you do not know at which stage the branch
condition is evaluated. Please answer the following questions.

(a) A program with 1000 dynamic instructions completes in 2211 cycles. If 200 of those
instructions are conditional branches, at the end of which pipeline stage the branch
instructions are resolved? (Assume that the pipeline does not stall for any other reason
than the conditional branches (e.g., data dependencies) during the execution of that
program.)

At the end of the 7th stage.

Explanation: Total cycles = 12 + 1000 + 200 ∗ X − 1


2211 = 1011 + 200 ∗ X
1400 = 200 ∗ X
X=6
Each branch causes 6 idle cycles (bubbles), thus branches are resolved at the end of 7th
stage.

architectureassignmenthelp.com
(b) In a new, higher-performance version of the processor, the architects implement a
mysterious branch prediction mechanism to improve the performance of the processor.
They keep the rest of the design exactly the same as before. The new design with the
mysterious branch predictor completes the execution of the following code in 115 cycles.

MOV R1, #0 / / R1 = 0
LOOP_1:
BEQ R1, #5, LAST // Branch to LAST if R1 == 5
ADD R1, R1, #1 // R1 = R1 + 1
MOV R2, #0 // R2 = 0
LOOP_2:
BEQ R2, #3, LOOP_1 // Branch to LOOP_1 if R2==3.
ADD R2, R2, #1 // R2 = R2 + 1
B LOOP_2 // Unconditional branch to LOOP_2
LAST:
MOV R1, #1 // R1 = 0

architectureassignmenthelp.com
Assume that the pipeline never stalls due to a data dependency. Based on the given
information, determine which of the following branch prediction mechanisms could be
the mysterious branch predictor implemented in the new version of the processor. For
each branch prediction mechanism below, you should circle the configuration parameters
that makes it match the performance of the mysterious branch predictor.

i) Static Branch Predictor


Could this be the mysterious branch predictor?
YES NO
If YES, for which configuration below is the answer YES? Pick an option for each
configuration parameter.

i. Static Prediction Direction


Always taken Always not taken

Explain:
YES, if the static prediction direction is always not taken.

Explanation: Such a predictor makes 6 mispredictions, which is the number


resulting in 115 cycles execution time for the above program.

architectureassignmenthelp.com
ii)] Last Time Branch Predictor
Could this be the mysterious branch predictor?

YES NO

If YES, for which configuration is the answer YES? Pick an option for each configuration
parameter.

i. Initial Prediction Direction


Taken Not taken

ii. Local for each branch instruction (PC-based) or global (shared among all branches)
history?

NO.

Explanation: There is not a configuration for this branch predictor that results in 6
mispredictions for the above program.

iii) Backward taken, Forward not taken (BTFN)

Could this be the mysterious branch predictor?


architectureassignmenthelp.com
YES NO

Explain:

NO.

Explanation: BTFN predictor does not make exactly 6 mispredictions for the
above program.

iv) Two-bit Counter Based Prediction (using saturating arithmetic)


Could this be the mysterious branch predictor?

YES NO

If YES, for which configuration is the answer YES? Pick an option for each
configuration parameter.

i. Initial Prediction Direction


00 (Strongly not taken)/10 (Weakly taken) 01 (Weakly not taken)/11 (Strongly taken)

architectureassignmenthelp.com
ii. Local for each branch instruction (i.e., PC-based, without any interference between
different branches) or global (i.e., a single counter shared among all branches) history?

Local Global

Explain:
YES, if local history registers with 00 or 01 initial values are used.

Explanation: Such a configuration yields 6 mispredictions, which results in 115


cycles execution time for the above program.

3 SIMD
We have two SIMD engines: 1) a traditional vector processor and 2) a traditional array
processor. Both processors can support a vector length up to 16.

All instructions can be fully pipelined, the processor can issue one vector instruction per
cycle, and the pipeline does not forward data (no chaining). For the sake of simplicity,
we ignore the latency of the pipeline stages other than the execution stages (e.g, decode
stage latency: 0 cycles, write back latency: 0 cycles, etc).
architectureassignmenthelp.com
We implement the following instructions in both designs, with their corresponding
execution latencies:

Operation Description Name Latency of a single


operation (VLEN=1)

VADD VDST ← VSRC1 + vector add 5 cycles


VSRC2
VMUL VDST ← VSRC1 * vector mult. 15 cycles
VSRC2
VSHR VDST ← VSRC >> vector shift 1 cycles
1
VLD VDST ← vector load 20 cycles
mem[SRC]
VST VSRC → vector store 20 cycles
mem[DST]
• All the vector instructions operate with a vector length specified by VLEN. The VLD
instruction loads VLEN consecutive elements from the DST address specified by the value
in the VDST register. The VST instruction stores VLEN elements from the VSRC register
in consecutive addresses in memory, starting from the address specified in DST.

architectureassignmenthelp.com
• Both processors have eight vector registers (VR0 to VR7) which can contain up to 16
elements, and eight scalar registers (R0 to R7). The entire vector register needs to be ready
(i.e., populated with all VLEN elements) before any element of it can be used as part of
another operation.
• The memory can sustain a throughput of one element per cycle. The memory consists of
16 banks that can be accessed independently. A single memory access can be initiated in
each cycle. The memory can sustain 16 parallel accesses if they all go to different banks.

(a) Which processor (array or vector processor) is more costly in terms of chip area?
Explain.
Array processor

Explanation: An array processor requires 16 functional units for an operation whereas


a vector processor requires only 1.

(b) The following code takes 52 cycles to execute on the vector processor:
VADD VR2 ← VR1, VR0
VADD VR3 ← VR2, VR5
VMUL VR6 ← VR2, VR3

architectureassignmenthelp.com
What is the VLEN of the instructions? Explain your answer.

VLEN: 10

Explanation: 5+(VLEN-1)+5+(VLEN-1)+15+(VLEN-1) = 52 ⇒ VLEN = 10

How long would the same code execute on an array processor with the same vector
length?
25 cycles

Explanation: there are data dependencies among instructions ⇒ 5+5+15= 25 cycles

(c) The following code takes 94 cycles to execute on the vector processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]
architectureassignmenthelp.com
Assume that the elements loaded in VR0 are all placed in different banks, and that the
elements loaded into VR1 are placed in the same banks as the elements in VR0.
Similarly, the elements of VR2 are stored in different banks in memory. What is the
VLEN of the instructions? Explain your answer.

VLEN: 8

Explanation: 20+20+(VLEN-1)+5+(VLEN-1)+1+(VLEN-1)+20+(VLEN-1) = 94.


⇒ VLEN = 8

(d) We replace the memory with a new module whose characteristics are unknown. The
following code (the same as that in (c)) takes 163 cycles to execute on the vector
processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]

architectureassignmenthelp.com
The VLEN of the instructions is 16. The elements loaded in VR0 are placed in
consecutive banks, the elements loaded in VR1 are placed in consecutive banks, and the
elements of VR2 are also stored in consecutive banks. What is the number of banks of the
new memory module? Explain.

[Correction] The number of cycles should be 170 instead of 163. For grading this
question the instructor took into account only the student’s reasoning.

Number of banks: 8

Explanation: Assuming that the number of banks is power of two, 20*(16/banks)+


20*(16/banks)+ (banks-1)+ 5+ (VLEN-1)+ 1+ (VLEN-1)+ 20*(16/banks)+ (banks-1) =
170 ⇒ banks=8

4 In-DRAM Bitmap Indices


Recall that in class we discussed Ambit, which is a DRAM design that can greatly
accelerate Bulk Bitwise Operations by providing the ability to perform bitwise AND/OR
of two rows in a subarray.

architectureassignmenthelp.com
One real-world application that can benefit from Ambit’s in-DRAM bulk bitwise
operations is the database bitmap index, as we also discussed in the lecture. By using
bitmap indices, we want to run the following query on a database that keeps track of user
actions: “How many unique users were active every week for the past w weeks?" Every
week, each user is represented by a single bit. If the user was active a given week, the
corresponding bit is set to 1. The total number of users is u.

We assume the bits corresponding to one week are all in the same row. If u is greater
than the total number of bits in one row (the row size is 8 kilobytes), more rows in
different subarrays are used for the same week. We assume that all weeks corresponding
to the users in one subarray fit in that subarray. We would like to compare two possible
implementations of the database query:

• CPU-based implementation: This implementation reads the bits of all u users for the w
weeks. For each user, it ands the bits corresponding to the past w weeks. Then, it
performs a bit-count operation to compute the final result.

Since this operation is very memory-bound, we simplify the estimation of the execution
time as the time needed to read all bits for the u users in the last w weeks. The memory
bandwidth that the CPU can exploit is X bytes/s.

architectureassignmenthelp.com
• Ambit-based implementation: This implementation takes advantage of bulk and operations
of Ambit. In each subarray, we reserve one Accumulation row and one Operand row
(besides the control rows that are needed for the regular operation of Ambit). Initially, all
bits in the Accumulation row are set to 1. Any row can be moved to the Operand row by
using RowClone (recall that RowClone is a mechanism that enables very fast copying of a
row to another row in the same subarray). trc and tand are the latencies (in seconds) of
RowClone’s copy and Ambit’s and respectively. Since Ambit does not support bit-count
operations inside DRAM, the final bit-count is still executed on the CPU. We consider that
the execution time of the bit-count operation is negligible compared to the time needed to
read all bits from the Accumulation rows by the CPU.

(a) What is the total number of DRAM rows that are occupied by u users and w weeks?

T otalRows = [ u /8×8k] × w.
Explanation:
The u users are spread across a number of subarrays:
NumSubarrays = [u /8×8k].

Thus, the total number of rows is:


TotalRows = [u/8×8k] × w.

architectureassignmenthelp.com
(b) What is the throughput in users/second of the Ambit-based implementation?

T hrAmbit = u/[ u 8×8k]×w×(trc+tand)+ u /X×8 users/second.

Explanation:

First, let us calculate the total time for all bulk and operations. We should add
trc and tand for all rows:
tand−total = [u 8×8k] × w × (trc + tand) seconds.

Then, we calculate the time needed to compute the bit count on CPU:
tbitcount = u/8/x = u/Xx8 seconds.

Thus, the throughput in users/s is:

T hrAmbit = u/tand−total+tbitcount users/second.

architectureassignmenthelp.com
(c) What is the throughput in users/second of the CPU implementation?

T hrCP U = X×8 /w users/second.


Explanation:
We calculate the time needed to bring all users and weeks to the CPU:
tCP U = uxw/8/X = u×w/X×8 seconds
Thus, the throughput in users/s is:
T hrCP U = u /tCP U = X×8/w users/second.

(d) What is the maximum w for the CPU implementation to be faster than the Ambit-
based implementation? Assume u is a multiple of the row size.

architectureassignmenthelp.com
5 BONUS: Caching vs. Processing-in-Memory

We are given the following piece of code that makes accesses to integer arrays A and
B. The size of each element in both A and B is 4 bytes. The base address of array A is
0x00001000, and the base address of B is 0x00008000.

movi R1, #0x1000 // Store the base address of A in R1


movi R2, #0x8000 // Store the base address of B in R2
movi R3, #0

architectureassignmenthelp.com
Outer_Loop:
movi R4, #0
movi R7, #0
Inner_Loop:
add R5, R3, R4 // R5 = R3 + R4
// load 4 bytes from memory address R1+R5
ld R5, [R1, R5] // R5 = Memory[R1 + R5],
ld R6, [R2, R4] // R6 = Memory[R2 + R4]
mul R5, R5, R6 // R5 = R5 * R6
add R7, R7, R5 // R7 += R5
inc R4 // R4++
bne R4, #2, Inner_Loop // If R4 != 2, jump to Inner_Loop

//store the data of R7 in memory address R1+R3


st [R1, R3], R7 // Memory[R1 + R3] = R7,
inc R3 // R3++
bne R3, #16, Outer_Loop // If R3 != 16, jump to Outer_Loop

architectureassignmenthelp.com
You are running the above code on a single-core processor. For now, assume that the
processor does not have caches. Therefore, all load/store instructions access the main
memory, which has a fixed 50- cycle latency, for both read and write operations. Assume
that all load/store operations are serialized, i.e., the latency of multiple memory requests
cannot be overlapped. Also assume that the execution time of a non-memory-access
instruction is zero (i.e., we ignore its execution time).

(a) What is the execution time of the above piece of code in cycles?

4000 cycles.
Explanation: There are 5 memory accesses for each outer loop iteration. The outer
loop iterates 16 times, and each memory access takes 50 cycles. 16 ∗ 5 ∗ 50 = 4000
cycles

(b) Assume that a 128-byte private cache is added to the processor core in the next-
generation processor. The cache block size is 8-byte. The cache is direct-mapped. On a
hit, the cache services both read and write requests in 5 cycles. On a miss, the main
memory is accessed and the access fills an 8-byte cache line in 50 cycles. Assuming that
the cache is initially empty, what is the new execution time on this processor with the
described cache? Show your work.

architectureassignmenthelp.com
900 cycles.

Explanation. At the beginning A and B conflict in the first two cache lines. Then
the elements of A and B go to different cache lines. The total execution time is 1910
cycles. Here is the access pattern for the first outer loop iteration:
0 − A[0], B[0], A[1], B[1], A[0]

The first 4 references are loads, the last (A[0]) is a store. The cache is initially empty.
We have a cache miss for A[0]. A[0] and A[1] is fetched to 0th index in the cache.
Then, B[0] is a miss, and it is conflicting with A[0]. So, A[0] and A[1] are evicted.
Similarly, all cache blocks in the first iteration are conflicting with each other. Since
we have only cache misses, the latency for those 5 references is 5 ∗ 50 = 250 cycles
The status of the cache after making those seven references is:

Cache Index Cache Block


0 A(0,1), B(0,1), A(0,1), B(0,1), A(0,1)
Second iteration on the outer loop:
1 − A[1], B[0], A[2], B[1], A[1]

architectureassignmenthelp.com
Cache hits/misses in the order of the references:
H, M, M, H, M
Latency = 2 ∗ 5 + 3 ∗ 50 = 165 cycles
Cache Status:
- A(0,1) is in set 0
- A(2,3) is in set 1
-the rest of the cache is empty

-2 − A[2], B[0], A[3], B[1], A[2]

Cache hits/misses:
H, M, H, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles

Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-the rest of the cache is empty

3 − A[3], B[0], A[4], B[1], A[3]

architectureassignmenthelp.com
Cache hits/misses:
H, H, M, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles

Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty

4 − A[4], B[0], A[5], B[1], A[4]


Cache hits/misses:
H, H, H, H, H
Latency : 5 ∗ 5 = 25 cycles

Cache Status:

-B(0,1) is in set 0
-B(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty

architectureassignmenthelp.com
After this point, single-miss and zero-miss (all hits) iterations are interleaved until the
16th iteration.

Overall Latency:
165 + 70 + (70 + 25) ∗ 7 = 900 cycles
(c) You are not satisfied with the performance after implementing the described cache.
To do better, you consider utilizing a processing unit that is available close to the main
memory. This processing unit can directly interface to the main memory with a 10-cycle
latency, for both read and write operations. How many cycles does it take to execute the
same program using the inmemory processing units? (Assume that the in-memory
processing unit does not have a cache, and the memory accesses are serialized like in
the processor core. The latency of the non-memory-access operations is ignored.)

800 cycles.

Explanation: Same as for the processor core without a cache, but the memory
access latency is 10 cycles instead of 50. 16 ∗ 5 ∗ 10 = 800

architectureassignmenthelp.com
(d) You friend now suggests that, by changing the cache capacity of the single-core
processor (in part (b)), she could provide as good performance as the system that utilizes
the memory processing unit (in part (c)).
Is she correct? What is the minimum capacity required for the cache of the single-core
processor to match the performance of the program running on the memory processing
unit?
No, she is not correct.

Explanation: Increasing the cache capacity does not help because doing so
cannot eliminate the conflicts to Set 0 in the first two iterations of the outer loop.

(e) What other changes could be made to the cache design to improve the
performance of the single-core processor on this program?

Increasing the associativity of the cache.

Explanation: Although there is enough cache capacity to exploit the locality of the
accesses, the fact that in the first two iterations the accessed data map to the same set
causes conflicts. To improve the hit rate and the performance, we can change the
addressto-set mapping policy. For example, we can change the cache design to be set-
associative or fully-associative.

architectureassignmenthelp.com

You might also like