0% found this document useful (0 votes)

14 views

Online Architecture Assignment Help

Uploaded by

Architecture Assignment Help

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Online Architecture Assignment Help

Uploaded by

Architecture Assignment Help

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Online Architecture Assignment Help

For any Assignment related queries, call us at : - +1 678 648 4277

visit : - https://www.architectureassignmenthelp.com/, or
Email : - [email protected]
1 Memory Scheduling

1.1 Basics and Assumptions

To serve a memory request, the memory controller issues one or multiple DRAM
commands to access data from a bank. There are four different DRAM commands as
discussed in class.

• ACTIVATE: Loads the row (that needs to be accessed) into the bank’s row-buffer.
This is called opening a row. (Latency: 15ns)
• PRECHARGE: Prepares the bank for the next access by closing the row in the bank
(and making the row buffer empty). (Latency: 15ns)
• READ/WRITE: Accesses data from the row-buffer. (Latency: 15ns)

The diagrams below show the snapshots of memory controller’s request queues at time
0, i.e., t0, when applications A, B, and C are executed together on a multi-core
processor. Each application runs on a separate core but shares the memory subsystem
with other applications. Each request is color-coded to denote the application to which
it belongs. Additionally, each request is annotated with a number that shows the order of
the request among the set of enqueued requests of the application to which it belongs.

architectureassignmenthelp.com
For example, A3 means that this is the third request from application A enqueued in the
request queue. Assume all memory requests are reads and a read request is considered to
be served when the READ command is complete (i.e., 15 ns after the request’s READ
command is issued).

Assume also the following:

• The memory system has two DRAM channels, one DRAM bank per channel, and four
rows per bank.
• All the row-buffers are closed (i.e., empty) at time 0.
• All applications start to stall at time 0 because of memory.
• No additional requests from any of the applications arrive at the memory controller.
• An application (A, B, or C) is considered to be stalled until all of its memory requests
(across all the request buffers) have been served.

architectureassignmenthelp.com
4.2 Problem Specification

The below table shows the stall time of applications A, B, and C with the FCFS (First-
Come, FirstServed) and FR-FCFS (First-Ready, First-Come, First-Served) scheduling
policies.

Scheduling Application A Application B Application C

FCFS 195 ns 285 ns 135 ns

FR-FCFS 135 ns 225 ns 90 ns

The diagrams below show the scheduling order of requests for Channel 0 and Channel
1 with the FCFS and FR-FCFS scheduling policies.

architectureassignmenthelp.com
What are the numbers of row hits and row misses for each DRAM bank with either of
the scheduling policies? Show your work.

Channel 0, hits: FCFS: 3, FR-FCFS: 5

Channel 0, misses: FCFS: 5, FR-FCFS: 3

Channel 1, hits: FCFS: 2, FR-FCFS: 4

Channel 1, misses: FCFS: 6, FR-FCFS: 4

To calculate the number of hits and misses we should consider the following facts:

• The first request of each channel will be always a row-buffer miss and it requires
one ACTIVATE and one READ command which lead to a 30 ns delay.
• For all requests in each channel, except for the first one, a row-buffer miss requires
one PRECHARGE, one ACTIVATE, and one READ command which lead to a 45 ns
delay.

architectureassignmenthelp.com
• A row-buffer hit requires one READ command which leads to a 15 ns delay.
• The stall time of each application is equal to the maximum service time of its
requests in both Channel 0 and Channel 1.
• When using the FR-FCFS policy, the requests will be reordered to exploit row buffer
locality. For example, the B1 request in Channel 0 is reordered in FR-FCFS with
respect to FCFS and executed after C1, A3, B2, and B4 requests. This means that A2,
C1, A3, B2, and B4 are all accessing the same row.

architectureassignmenthelp.com
2 Branch Prediction

A processor implements an in-order pipeline with 12 stages. Each stage completes in a

single cycle. The pipeline stalls on a conditional branch instruction until the condition
of the branch is evaluated. However, you do not know at which stage the branch
condition is evaluated. Please answer the following questions.

(a) A program with 1000 dynamic instructions completes in 2211 cycles. If 200 of those
instructions are conditional branches, at the end of which pipeline stage the branch
instructions are resolved? (Assume that the pipeline does not stall for any other reason
than the conditional branches (e.g., data dependencies) during the execution of that
program.)

At the end of the 7th stage.

Explanation: Total cycles = 12 + 1000 + 200 ∗ X − 1

2211 = 1011 + 200 ∗ X
1400 = 200 ∗ X
X=6
Each branch causes 6 idle cycles (bubbles), thus branches are resolved at the end of 7th
stage.

architectureassignmenthelp.com
(b) In a new, higher-performance version of the processor, the architects implement a
mysterious branch prediction mechanism to improve the performance of the processor.
They keep the rest of the design exactly the same as before. The new design with the
mysterious branch predictor completes the execution of the following code in 115 cycles.

MOV R1, #0 / / R1 = 0
LOOP_1:
BEQ R1, #5, LAST // Branch to LAST if R1 == 5
ADD R1, R1, #1 // R1 = R1 + 1
MOV R2, #0 // R2 = 0
LOOP_2:
BEQ R2, #3, LOOP_1 // Branch to LOOP_1 if R2==3.
ADD R2, R2, #1 // R2 = R2 + 1
B LOOP_2 // Unconditional branch to LOOP_2
LAST:
MOV R1, #1 // R1 = 0

architectureassignmenthelp.com
Assume that the pipeline never stalls due to a data dependency. Based on the given
information, determine which of the following branch prediction mechanisms could be
the mysterious branch predictor implemented in the new version of the processor. For
each branch prediction mechanism below, you should circle the configuration parameters
that makes it match the performance of the mysterious branch predictor.

i) Static Branch Predictor

Could this be the mysterious branch predictor?
YES NO
If YES, for which configuration below is the answer YES? Pick an option for each
configuration parameter.

i. Static Prediction Direction

Always taken Always not taken

Explain:
YES, if the static prediction direction is always not taken.

Explanation: Such a predictor makes 6 mispredictions, which is the number

resulting in 115 cycles execution time for the above program.

architectureassignmenthelp.com
ii)] Last Time Branch Predictor
Could this be the mysterious branch predictor?

YES NO

If YES, for which configuration is the answer YES? Pick an option for each configuration
parameter.

i. Initial Prediction Direction

Taken Not taken

ii. Local for each branch instruction (PC-based) or global (shared among all branches)
history?

NO.

Explanation: There is not a configuration for this branch predictor that results in 6
mispredictions for the above program.

iii) Backward taken, Forward not taken (BTFN)

Could this be the mysterious branch predictor?

architectureassignmenthelp.com
YES NO

Explain:

NO.

Explanation: BTFN predictor does not make exactly 6 mispredictions for the
above program.

iv) Two-bit Counter Based Prediction (using saturating arithmetic)

Could this be the mysterious branch predictor?

YES NO

If YES, for which configuration is the answer YES? Pick an option for each
configuration parameter.

i. Initial Prediction Direction

00 (Strongly not taken)/10 (Weakly taken) 01 (Weakly not taken)/11 (Strongly taken)

architectureassignmenthelp.com
ii. Local for each branch instruction (i.e., PC-based, without any interference between
different branches) or global (i.e., a single counter shared among all branches) history?

Local Global

Explain:
YES, if local history registers with 00 or 01 initial values are used.

Explanation: Such a configuration yields 6 mispredictions, which results in 115

cycles execution time for the above program.

3 SIMD
We have two SIMD engines: 1) a traditional vector processor and 2) a traditional array
processor. Both processors can support a vector length up to 16.

All instructions can be fully pipelined, the processor can issue one vector instruction per
cycle, and the pipeline does not forward data (no chaining). For the sake of simplicity,
we ignore the latency of the pipeline stages other than the execution stages (e.g, decode
stage latency: 0 cycles, write back latency: 0 cycles, etc).
architectureassignmenthelp.com
We implement the following instructions in both designs, with their corresponding
execution latencies:

Operation Description Name Latency of a single

operation (VLEN=1)

VADD VDST ← VSRC1 + vector add 5 cycles

VSRC2
VMUL VDST ← VSRC1 * vector mult. 15 cycles
VSRC2
VSHR VDST ← VSRC >> vector shift 1 cycles
1
VLD VDST ← vector load 20 cycles
mem[SRC]
VST VSRC → vector store 20 cycles
mem[DST]
• All the vector instructions operate with a vector length specified by VLEN. The VLD
instruction loads VLEN consecutive elements from the DST address specified by the value
in the VDST register. The VST instruction stores VLEN elements from the VSRC register
in consecutive addresses in memory, starting from the address specified in DST.

architectureassignmenthelp.com
• Both processors have eight vector registers (VR0 to VR7) which can contain up to 16
elements, and eight scalar registers (R0 to R7). The entire vector register needs to be ready
(i.e., populated with all VLEN elements) before any element of it can be used as part of
another operation.
• The memory can sustain a throughput of one element per cycle. The memory consists of
16 banks that can be accessed independently. A single memory access can be initiated in
each cycle. The memory can sustain 16 parallel accesses if they all go to different banks.

(a) Which processor (array or vector processor) is more costly in terms of chip area?
Explain.
Array processor

Explanation: An array processor requires 16 functional units for an operation whereas

a vector processor requires only 1.

(b) The following code takes 52 cycles to execute on the vector processor:
VADD VR2 ← VR1, VR0
VADD VR3 ← VR2, VR5
VMUL VR6 ← VR2, VR3

architectureassignmenthelp.com
What is the VLEN of the instructions? Explain your answer.

VLEN: 10

Explanation: 5+(VLEN-1)+5+(VLEN-1)+15+(VLEN-1) = 52 ⇒ VLEN = 10

How long would the same code execute on an array processor with the same vector
length?
25 cycles

Explanation: there are data dependencies among instructions ⇒ 5+5+15= 25 cycles

(c) The following code takes 94 cycles to execute on the vector processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]
architectureassignmenthelp.com
Assume that the elements loaded in VR0 are all placed in different banks, and that the
elements loaded into VR1 are placed in the same banks as the elements in VR0.
Similarly, the elements of VR2 are stored in different banks in memory. What is the
VLEN of the instructions? Explain your answer.

VLEN: 8

Explanation: 20+20+(VLEN-1)+5+(VLEN-1)+1+(VLEN-1)+20+(VLEN-1) = 94.

⇒ VLEN = 8

(d) We replace the memory with a new module whose characteristics are unknown. The
following code (the same as that in (c)) takes 163 cycles to execute on the vector
processor:
VLD VR0 ← mem[ R0 ]
VLD VR1 ← mem[ R1 ]
VADD VR2 ← VR1, VR0
VSHR VR2 ← VR2
VST VR2 → mem[ R2 ]

architectureassignmenthelp.com
The VLEN of the instructions is 16. The elements loaded in VR0 are placed in
consecutive banks, the elements loaded in VR1 are placed in consecutive banks, and the
elements of VR2 are also stored in consecutive banks. What is the number of banks of the
new memory module? Explain.

[Correction] The number of cycles should be 170 instead of 163. For grading this
question the instructor took into account only the student’s reasoning.

Number of banks: 8

Explanation: Assuming that the number of banks is power of two, 20*(16/banks)+

20*(16/banks)+ (banks-1)+ 5+ (VLEN-1)+ 1+ (VLEN-1)+ 20*(16/banks)+ (banks-1) =
170 ⇒ banks=8

4 In-DRAM Bitmap Indices

Recall that in class we discussed Ambit, which is a DRAM design that can greatly
accelerate Bulk Bitwise Operations by providing the ability to perform bitwise AND/OR
of two rows in a subarray.

architectureassignmenthelp.com
One real-world application that can benefit from Ambit’s in-DRAM bulk bitwise
operations is the database bitmap index, as we also discussed in the lecture. By using
bitmap indices, we want to run the following query on a database that keeps track of user
actions: “How many unique users were active every week for the past w weeks?" Every
week, each user is represented by a single bit. If the user was active a given week, the
corresponding bit is set to 1. The total number of users is u.

We assume the bits corresponding to one week are all in the same row. If u is greater
than the total number of bits in one row (the row size is 8 kilobytes), more rows in
different subarrays are used for the same week. We assume that all weeks corresponding
to the users in one subarray fit in that subarray. We would like to compare two possible
implementations of the database query:

• CPU-based implementation: This implementation reads the bits of all u users for the w
weeks. For each user, it ands the bits corresponding to the past w weeks. Then, it
performs a bit-count operation to compute the final result.

Since this operation is very memory-bound, we simplify the estimation of the execution
time as the time needed to read all bits for the u users in the last w weeks. The memory
bandwidth that the CPU can exploit is X bytes/s.

architectureassignmenthelp.com
• Ambit-based implementation: This implementation takes advantage of bulk and operations
of Ambit. In each subarray, we reserve one Accumulation row and one Operand row
(besides the control rows that are needed for the regular operation of Ambit). Initially, all
bits in the Accumulation row are set to 1. Any row can be moved to the Operand row by
using RowClone (recall that RowClone is a mechanism that enables very fast copying of a
row to another row in the same subarray). trc and tand are the latencies (in seconds) of
RowClone’s copy and Ambit’s and respectively. Since Ambit does not support bit-count
operations inside DRAM, the final bit-count is still executed on the CPU. We consider that
the execution time of the bit-count operation is negligible compared to the time needed to
read all bits from the Accumulation rows by the CPU.

(a) What is the total number of DRAM rows that are occupied by u users and w weeks?

T otalRows = [ u /8×8k] × w.
Explanation:
The u users are spread across a number of subarrays:
NumSubarrays = [u /8×8k].

Thus, the total number of rows is:

TotalRows = [u/8×8k] × w.

architectureassignmenthelp.com
(b) What is the throughput in users/second of the Ambit-based implementation?

T hrAmbit = u/[ u 8×8k]×w×(trc+tand)+ u /X×8 users/second.

Explanation:

First, let us calculate the total time for all bulk and operations. We should add
trc and tand for all rows:
tand−total = [u 8×8k] × w × (trc + tand) seconds.

Then, we calculate the time needed to compute the bit count on CPU:
tbitcount = u/8/x = u/Xx8 seconds.

Thus, the throughput in users/s is:

T hrAmbit = u/tand−total+tbitcount users/second.

architectureassignmenthelp.com
(c) What is the throughput in users/second of the CPU implementation?

T hrCP U = X×8 /w users/second.

Explanation:
We calculate the time needed to bring all users and weeks to the CPU:
tCP U = uxw/8/X = u×w/X×8 seconds
Thus, the throughput in users/s is:
T hrCP U = u /tCP U = X×8/w users/second.

(d) What is the maximum w for the CPU implementation to be faster than the Ambit-
based implementation? Assume u is a multiple of the row size.

architectureassignmenthelp.com
5 BONUS: Caching vs. Processing-in-Memory

We are given the following piece of code that makes accesses to integer arrays A and
B. The size of each element in both A and B is 4 bytes. The base address of array A is
0x00001000, and the base address of B is 0x00008000.

movi R1, #0x1000 // Store the base address of A in R1

movi R2, #0x8000 // Store the base address of B in R2
movi R3, #0

architectureassignmenthelp.com
Outer_Loop:
movi R4, #0
movi R7, #0
Inner_Loop:
add R5, R3, R4 // R5 = R3 + R4
// load 4 bytes from memory address R1+R5
ld R5, [R1, R5] // R5 = Memory[R1 + R5],
ld R6, [R2, R4] // R6 = Memory[R2 + R4]
mul R5, R5, R6 // R5 = R5 * R6
add R7, R7, R5 // R7 += R5
inc R4 // R4++
bne R4, #2, Inner_Loop // If R4 != 2, jump to Inner_Loop

//store the data of R7 in memory address R1+R3

st [R1, R3], R7 // Memory[R1 + R3] = R7,
inc R3 // R3++
bne R3, #16, Outer_Loop // If R3 != 16, jump to Outer_Loop

architectureassignmenthelp.com
You are running the above code on a single-core processor. For now, assume that the
processor does not have caches. Therefore, all load/store instructions access the main
memory, which has a fixed 50- cycle latency, for both read and write operations. Assume
that all load/store operations are serialized, i.e., the latency of multiple memory requests
cannot be overlapped. Also assume that the execution time of a non-memory-access
instruction is zero (i.e., we ignore its execution time).

(a) What is the execution time of the above piece of code in cycles?

4000 cycles.
Explanation: There are 5 memory accesses for each outer loop iteration. The outer
loop iterates 16 times, and each memory access takes 50 cycles. 16 ∗ 5 ∗ 50 = 4000
cycles

(b) Assume that a 128-byte private cache is added to the processor core in the next-
generation processor. The cache block size is 8-byte. The cache is direct-mapped. On a
hit, the cache services both read and write requests in 5 cycles. On a miss, the main
memory is accessed and the access fills an 8-byte cache line in 50 cycles. Assuming that
the cache is initially empty, what is the new execution time on this processor with the
described cache? Show your work.

architectureassignmenthelp.com
900 cycles.

Explanation. At the beginning A and B conflict in the first two cache lines. Then
the elements of A and B go to different cache lines. The total execution time is 1910
cycles. Here is the access pattern for the first outer loop iteration:
0 − A[0], B[0], A[1], B[1], A[0]

The first 4 references are loads, the last (A[0]) is a store. The cache is initially empty.
We have a cache miss for A[0]. A[0] and A[1] is fetched to 0th index in the cache.
Then, B[0] is a miss, and it is conflicting with A[0]. So, A[0] and A[1] are evicted.
Similarly, all cache blocks in the first iteration are conflicting with each other. Since
we have only cache misses, the latency for those 5 references is 5 ∗ 50 = 250 cycles
The status of the cache after making those seven references is:

Cache Index Cache Block

0 A(0,1), B(0,1), A(0,1), B(0,1), A(0,1)
Second iteration on the outer loop:
1 − A[1], B[0], A[2], B[1], A[1]

architectureassignmenthelp.com
Cache hits/misses in the order of the references:
H, M, M, H, M
Latency = 2 ∗ 5 + 3 ∗ 50 = 165 cycles
Cache Status:
- A(0,1) is in set 0
- A(2,3) is in set 1
-the rest of the cache is empty

-2 − A[2], B[0], A[3], B[1], A[2]

Cache hits/misses:
H, M, H, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles

Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-the rest of the cache is empty

3 − A[3], B[0], A[4], B[1], A[3]

architectureassignmenthelp.com
Cache hits/misses:
H, H, M, H, H
Latency : 4 ∗ 5 + 1 ∗ 50 = 70 cycles

Cache Status:
-B(0,1) is in set 0
-A(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty

4 − A[4], B[0], A[5], B[1], A[4]

Cache hits/misses:
H, H, H, H, H
Latency : 5 ∗ 5 = 25 cycles

Cache Status:

-B(0,1) is in set 0
-B(2,3) is in set 1
-A(4,5) is in set 2
-the rest of the cache is empty

architectureassignmenthelp.com
After this point, single-miss and zero-miss (all hits) iterations are interleaved until the
16th iteration.

Overall Latency:
165 + 70 + (70 + 25) ∗ 7 = 900 cycles
(c) You are not satisfied with the performance after implementing the described cache.
To do better, you consider utilizing a processing unit that is available close to the main
memory. This processing unit can directly interface to the main memory with a 10-cycle
latency, for both read and write operations. How many cycles does it take to execute the
same program using the inmemory processing units? (Assume that the in-memory
processing unit does not have a cache, and the memory accesses are serialized like in
the processor core. The latency of the non-memory-access operations is ignored.)

800 cycles.

Explanation: Same as for the processor core without a cache, but the memory
access latency is 10 cycles instead of 50. 16 ∗ 5 ∗ 10 = 800

architectureassignmenthelp.com
(d) You friend now suggests that, by changing the cache capacity of the single-core
processor (in part (b)), she could provide as good performance as the system that utilizes
the memory processing unit (in part (c)).
Is she correct? What is the minimum capacity required for the cache of the single-core
processor to match the performance of the program running on the memory processing
unit?
No, she is not correct.

Explanation: Increasing the cache capacity does not help because doing so
cannot eliminate the conflicts to Set 0 in the first two iterations of the outer loop.

(e) What other changes could be made to the cache design to improve the
performance of the single-core processor on this program?

Increasing the associativity of the cache.

Explanation: Although there is enough cache capacity to exploit the locality of the
accesses, the fact that in the first two iterations the accessed data map to the same set
causes conflicts. To improve the hit rate and the performance, we can change the
addressto-set mapping policy. For example, we can change the cache design to be set-
associative or fully-associative.

architectureassignmenthelp.com

Numerical: Central Processing Unit
No ratings yet
Numerical: Central Processing Unit
28 pages
NVIDIA Questions
0% (1)
NVIDIA Questions
5 pages
Digital Design Interview Questions & Answers
No ratings yet
Digital Design Interview Questions & Answers
5 pages
Interviews Question For Physical Design: Digital Design Interview Questions
100% (1)
Interviews Question For Physical Design: Digital Design Interview Questions
44 pages
ES Final Jan10 PDF
No ratings yet
ES Final Jan10 PDF
6 pages
Section 1: Correct
100% (1)
Section 1: Correct
153 pages
Appendix A PDF
No ratings yet
Appendix A PDF
20 pages
Instruction Level Parallelism: Pipelining
No ratings yet
Instruction Level Parallelism: Pipelining
6 pages
M116C 1 EE116C-Midterm2-w15 Solution
100% (1)
M116C 1 EE116C-Midterm2-w15 Solution
8 pages
Assignment Nov 19
No ratings yet
Assignment Nov 19
7 pages
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
No ratings yet
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
7 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
cs146 Fall2017 Midterm1xx
No ratings yet
cs146 Fall2017 Midterm1xx
12 pages
Synthesis Related Questions
No ratings yet
Synthesis Related Questions
19 pages
Homework 2
No ratings yet
Homework 2
8 pages
Comp Architecture Sample Questions
No ratings yet
Comp Architecture Sample Questions
9 pages
Linear Pipeline - Collosion Vector Analysis
No ratings yet
Linear Pipeline - Collosion Vector Analysis
23 pages
Kien-Truc-May-Tinh - David-Brooks - cs146-hw2 - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh - David-Brooks - cs146-hw2 - (Cuuduongthancong - Com)
5 pages
Explaining The Explain
No ratings yet
Explaining The Explain
41 pages
Mid Paper
No ratings yet
Mid Paper
29 pages
4-The Processors
No ratings yet
4-The Processors
3 pages
Midterm1 Soln Fall09 PDF
No ratings yet
Midterm1 Soln Fall09 PDF
6 pages
Assignment 2 Solution
0% (1)
Assignment 2 Solution
4 pages
CS60003 High Performance Computer Architecture
No ratings yet
CS60003 High Performance Computer Architecture
3 pages
Homework3 Solution v2
No ratings yet
Homework3 Solution v2
41 pages
Interrupts of 8051
No ratings yet
Interrupts of 8051
8 pages
Paraphrase
No ratings yet
Paraphrase
51 pages
CS-3010 (HPC) - CS Mid Sept 2023
No ratings yet
CS-3010 (HPC) - CS Mid Sept 2023
7 pages
07 Explain Explain
No ratings yet
07 Explain Explain
43 pages
Ca CT2
No ratings yet
Ca CT2
4 pages
Cse590490 HW2
No ratings yet
Cse590490 HW2
5 pages
Assignment: - 4: Part - A
No ratings yet
Assignment: - 4: Part - A
9 pages
HW Problems Chapter-1
No ratings yet
HW Problems Chapter-1
3 pages
Verification and Computer Architecture Important Links
No ratings yet
Verification and Computer Architecture Important Links
22 pages
CCNA Questions July 2019
No ratings yet
CCNA Questions July 2019
294 pages
CCNAquestions July 2019 PDF
No ratings yet
CCNAquestions July 2019 PDF
294 pages
DSP-8 (DSP Processors)
No ratings yet
DSP-8 (DSP Processors)
8 pages
Synthesis
No ratings yet
Synthesis
10 pages
RISC Instruction Set:: I) Data Manipulation Instructions
No ratings yet
RISC Instruction Set:: I) Data Manipulation Instructions
8 pages
Solution of Questions from Chapter 4-COAL.docx
No ratings yet
Solution of Questions from Chapter 4-COAL.docx
28 pages
PA2 - Lehra Do! Prefetchers
No ratings yet
PA2 - Lehra Do! Prefetchers
6 pages
HW2 Operating Systems
No ratings yet
HW2 Operating Systems
9 pages
Classic RISC Pipeline
No ratings yet
Classic RISC Pipeline
10 pages
LTE Int Rec
No ratings yet
LTE Int Rec
3 pages
Analog Design Assignment: Analog and Digital VLSI Design - EEE C443 February 1, 2012
No ratings yet
Analog Design Assignment: Analog and Digital VLSI Design - EEE C443 February 1, 2012
9 pages
Test 1 Answer Key
No ratings yet
Test 1 Answer Key
9 pages
SET-1: Answer To The Interview Questions
No ratings yet
SET-1: Answer To The Interview Questions
9 pages
CS6303 - CA - IQ - Nov - Dec 2017 PDF
No ratings yet
CS6303 - CA - IQ - Nov - Dec 2017 PDF
3 pages
Gate Questions Operating Systems (1)
No ratings yet
Gate Questions Operating Systems (1)
50 pages
UNIT-5: Pipeline and Vector Processing
No ratings yet
UNIT-5: Pipeline and Vector Processing
63 pages
CIS2400_Mod05_HW-Datapath-FSMs_w_timing
No ratings yet
CIS2400_Mod05_HW-Datapath-FSMs_w_timing
4 pages
Indian Institute of Technology, Kharagpur: Mid-Spring Semester 2021-22
No ratings yet
Indian Institute of Technology, Kharagpur: Mid-Spring Semester 2021-22
4 pages
Architecture
No ratings yet
Architecture
21 pages
disc08-sols
No ratings yet
disc08-sols
8 pages
CSC 424 Assignment
100% (1)
CSC 424 Assignment
8 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
ICC 201012 LG 06 Finishing
No ratings yet
ICC 201012 LG 06 Finishing
14 pages
Ece305 PDF
No ratings yet
Ece305 PDF
24 pages
PipelineHazards
No ratings yet
PipelineHazards
4 pages
Hillstone All Series Device Troubleshooting and Debug Guide
No ratings yet
Hillstone All Series Device Troubleshooting and Debug Guide
128 pages
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
From Everand
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
Mamta Devi
No ratings yet
CCNA Exam Excellence: Study Guide & Practice Tests
From Everand
CCNA Exam Excellence: Study Guide & Practice Tests
SUJAN
No ratings yet
Mainframes Refresher
100% (1)
Mainframes Refresher
285 pages
030-100 Exam Dumps - Web Development Essentials
No ratings yet
030-100 Exam Dumps - Web Development Essentials
5 pages
Postman Presentation Theory
No ratings yet
Postman Presentation Theory
21 pages
FOCAS2/Ethernet For Linux Operator'S Manual
No ratings yet
FOCAS2/Ethernet For Linux Operator'S Manual
15 pages
C Programming Exercice
No ratings yet
C Programming Exercice
5 pages
Algorithms Data Structures GATE Computer Science Postal Study Material
No ratings yet
Algorithms Data Structures GATE Computer Science Postal Study Material
15 pages
Unit-3 Ooad
No ratings yet
Unit-3 Ooad
7 pages
Chapter 4 PPT
No ratings yet
Chapter 4 PPT
30 pages
Verilog SV Interview Questions
No ratings yet
Verilog SV Interview Questions
21 pages
3rd Sem Imp
No ratings yet
3rd Sem Imp
3 pages
Java Pass by Value - Javatpoint
No ratings yet
Java Pass by Value - Javatpoint
1 page
R Vs Python For Data Science
No ratings yet
R Vs Python For Data Science
7 pages
PSR-1 - Basic Coding Standard - PHP-FIG
No ratings yet
PSR-1 - Basic Coding Standard - PHP-FIG
6 pages
Oracle APPS Interview Question & Answers
100% (1)
Oracle APPS Interview Question & Answers
60 pages
Term Paper CSE-101: Submitted To Submitted by
No ratings yet
Term Paper CSE-101: Submitted To Submitted by
14 pages
FileNet Tuning Parameters
No ratings yet
FileNet Tuning Parameters
24 pages
TTT
No ratings yet
TTT
22 pages
Nguyen Thanh Nam
No ratings yet
Nguyen Thanh Nam
21 pages
Chapter05 ERD Edited#1
No ratings yet
Chapter05 ERD Edited#1
54 pages
Dsa Shhet by Arsh Goyal
No ratings yet
Dsa Shhet by Arsh Goyal
158 pages
TB_Unit4ProgressCheckFRQ_65b8681ad49bf8.65b8681bb763a6.55461556
No ratings yet
TB_Unit4ProgressCheckFRQ_65b8681ad49bf8.65b8681bb763a6.55461556
5 pages
EJ Unit 4
No ratings yet
EJ Unit 4
39 pages
Aman - DS
0% (1)
Aman - DS
59 pages
Data Structure Lab Manual
No ratings yet
Data Structure Lab Manual
77 pages
Understanding LabVIEW Programming Patterns and Frameworks
No ratings yet
Understanding LabVIEW Programming Patterns and Frameworks
43 pages
Procmail Recipes
No ratings yet
Procmail Recipes
184 pages
Chapter 8 Optimization 1
No ratings yet
Chapter 8 Optimization 1
5 pages
ADAM-5560 UserManual V1.2 (EN)
No ratings yet
ADAM-5560 UserManual V1.2 (EN)
194 pages

Online Architecture Assignment Help

Uploaded by

Online Architecture Assignment Help

Uploaded by

Online Architecture Assignment Help

For any Assignment related queries, call us at : - +1 678 648 4277

1.1 Basics and Assumptions

Assume also the following:

Scheduling Application A Application B Application C

FCFS 195 ns 285 ns 135 ns

Channel 0, hits: FCFS: 3, FR-FCFS: 5

Channel 0, misses: FCFS: 5, FR-FCFS: 3

Channel 1, hits: FCFS: 2, FR-FCFS: 4

Channel 1, misses: FCFS: 6, FR-FCFS: 4

A processor implements an in-order pipeline with 12 stages. Each stage completes in a

At the end of the 7th stage.

Explanation: Total cycles = 12 + 1000 + 200 ∗ X − 1

i) Static Branch Predictor

i. Static Prediction Direction

Explanation: Such a predictor makes 6 mispredictions, which is the number

i. Initial Prediction Direction

iii) Backward taken, Forward not taken (BTFN)

Could this be the mysterious branch predictor?

iv) Two-bit Counter Based Prediction (using saturating arithmetic)

i. Initial Prediction Direction

Explanation: Such a configuration yields 6 mispredictions, which results in 115

Operation Description Name Latency of a single

VADD VDST ← VSRC1 + vector add 5 cycles

Explanation: An array processor requires 16 functional units for an operation whereas

Explanation: 5+(VLEN-1)+5+(VLEN-1)+15+(VLEN-1) = 52 ⇒ VLEN = 10

Explanation: there are data dependencies among instructions ⇒ 5+5+15= 25 cycles

Explanation: 20+20+(VLEN-1)+5+(VLEN-1)+1+(VLEN-1)+20+(VLEN-1) = 94.

Explanation: Assuming that the number of banks is power of two, 20*(16/banks)+

4 In-DRAM Bitmap Indices

Thus, the total number of rows is:

T hrAmbit = u/[ u 8×8k]×w×(trc+tand)+ u /X×8 users/second.

Thus, the throughput in users/s is:

T hrAmbit = u/tand−total+tbitcount users/second.

T hrCP U = X×8 /w users/second.

movi R1, #0x1000 // Store the base address of A in R1

//store the data of R7 in memory address R1+R3

Cache Index Cache Block

-2 − A[2], B[0], A[3], B[1], A[2]

3 − A[3], B[0], A[4], B[1], A[3]

4 − A[4], B[0], A[5], B[1], A[4]

Increasing the associativity of the cache.

You might also like