0% found this document useful (0 votes)
117 views4 pages

BITS Pilani CS F342 Exam Dec 2024 Details

Uploaded by

f20221319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views4 pages

BITS Pilani CS F342 Exam Dec 2024 Details

Uploaded by

f20221319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Birla Institute of Technology & Science – Pilani

Hyderabad Campus
st
1 Semester 2024-2025
Computer Architecture (CS F342) – Comprehensive Examination (Regular)
Date: 07.12.2024 Weightage: 40% Duration: 3 hours Type: Closed Book
Instructions: Answer all questions. All parts of a question should be answered consecutively. [Link] pages: 4
Q1. (a) This problem is to find out the impact of two different cache organizations - one cache is direct mapped and the
other is two-way set associative - on the performance of a processor. Assume that the CPI with a perfect cache is 1.6, the
clock cycle time is 0.35 ns, there are 1.4 memory references per instruction, the size of both caches is 128 KB, and both
have a block size of 64 bytes. The cache miss penalty is 65 ns for either cache organization. Assume the hit time is 1 clock
cycle, the miss rate of the direct mapped 128 KB cache is 2.1%, and the miss rate for a two-way set associative cache of
the same size is 1.9%. Assume that the processor clock cycle time must be stretched 1.35 times to accommodate the
selection multiplexer of the set associative cache.
(i) Compute the Average Memory Access Time (AMAT) for both these two cache organizations and find out which cache
organization is better?
(ii) Now apply the CPU performance equation (with memory latency) to find out the relative performance of these two
cache organizations.
(b) Given a 2-way set associative mapped cache of size 16 KB with block size 256 bytes. The size of main memory is 128
KB. (i) Find the number of bits used to represent the tag field (ii) Find the tag directory size in bytes.
(4 + 4 = 8 marks)
Q2. (a) Consider the Booth’s algorithm for signed multiplication; let A be the accumulator register, Q be the multiplier
register (5 bits) and M be the multiplicand register (5 bits). Also let Q-1 be a single bit special register. An initial set of
values for implementation of the Booth’s algorithm is given in the table below. Complete the entries in the table. Also,
what are the values of Q, M and A in decimal? Use the notation ASHR to represent arithmetic shift right operation. Please
note that all values filled in each row should be correct to get full marks. Reproduce the table while answering.
A Q Q-1 M operation
00000 10011 0 10111 initial stage

(b) An n-bit positive divisor is loaded into the register M and an n-bit positive dividend is loaded into the register Q at the
start of the division operation. Register A is set to 0. After the division is complete, the n-bit quotient is in register Q and
the remainder is in register A. (i) What are the inputs (in decimal) and outputs (in decimal) in this restoring division?
(ii) For the given restoring division scenario, label all the steps across the four cycles to the left of each step.

(4 + 4 = 8 marks)
Q3. (a) (i) Assume the distribution of various types of instructions that run on a processor is as follows: 50%: ALU; 25%:
BEQ; 15%: LW and 10%: SW. If there are no stalls or hazards, what is the utilization of the data memory? What is the
utilization of the register block's write port? (find utilization in percentage of clock cycles used). (ii) You are given a non-
pipelined processor design which has a cycle time of 10 ns and average CPI of 1.4. Assume pipelining changes the CPI to
1. Calculate the latency speedup in the following scenarios. Use the following tabular format for answering.

Scenario Speedup
Pipelining into 5 stages with new latency 2 ns
Pipelining into 5 stages with 1ns, 1.5 ns, 4 ns, 3 ns, and 0.5 ns latency
After adding the pipeline register delay of 20 ps to the above i.e., 5 stages with 1ns, 1.5 ns, 4 ns, 3 ns, and
0.5 ns latency

(b) (i) Consider a hypothetical ISA, where an instruction requires four stages (IF: 30 ns); (ID: 9 ns); (EX: 20 ns);
(WB: 10 ns) . This instruction must proceed through the stages in sequence. What is the minimum asynchronous time for
any single instruction to complete? (ii) Now we want to implement the above instruction in part b (i) as a pipelined
instruction. How many stages should we have and at what rate should we clock the pipeline?
(4 + 4 = 8 marks)

Q4. (a) Represent the number 85.12510 in IEEE 754 single precision format; all the steps in the computation need to be
shown clearly.
(b) Consider a hypothetical 6-bit floating point format (for positive numbers only). The 3 leading bits are used for exponent
(E, in 2’s complement format) and the 3 remaining bits for mantissa (M ). The value is evaluated as 1.M × 2E. Asnwer the
following questions as per this floating point format.

(i) What is the decimal value of 1111012? (ii) What is the decimal value of 0110112?
(iii) Find the range of this format in base 10; i.e. ? ≤ N ≤ ?. (iv) What is 2.510 in binary as per this format?

(v) What is 0.3910 in binary as per this format? (use simple truncation
for bits exceeding the format field).
(3 + 5 = 8 marks)

Q5. (a)(i) Conditional control instructions sometimes incur stalls in execution. Given 15 out of 100 instructions are
branches; ideal CPI is 1; we have a Branch Target Buffer (BTB) with 10% miss rate incurring 3 clock cycle miss penalty;
for the rest, the prediction accuracy is 92% which incurs a 7 cycle penalty in the event of a BHT (BPB) misprediction.
What is the new CPI resultant of the branch stalls? (ii) This problem pertains to comparing two types of dynamic branch
predictors: one, a 2-bit predictor with 92% accuracy and the other a 2-level correlating predictor with 4% more accuracy.
As is reasonable for a modern high-frequency processor, assume that the branches resolve in stage 10 and that 20% of the
instructions are branches. Also, as you are aware, the correctly predicted branches will have 0-cycle penalty safely ensuring
ideal CPI. Does the 2-level correlating predictor offer a speedup over the 2-bit predictor? If so, by how much?

(b) (i) Data hazards form 85% of all hazards in modern pipelined execution. For the given code segments below, categorize
the data hazards as WAR, RAW or WAW. What is the total count of such hazards here? Answer in tabular format only.

Code segment Type of Data Hazard


MUL R6, R1, R2
ADD R7, R5, R6
MUL R6, R1, R2
ADD R6, R4, R5
DIV R3, R1, R2
MUL R5, R4, R3
ADD R4, R0, R6

(ii) Given that 40% of ALU instructions are followed by a dependent ALU instruction with separation 2 and that 90% of
instructions are ALU instructions. What is the worst-case scenario CPI?
(4 + 4 = 8 marks)
Q6. (a) (i) The MIPS assembler by design, uses pseudo instructions and every peudo instruction is replaced with a set of
actual MIPS instructions at the hardware level to accomplish a [Link] pseudo instruction instr is replaced with the
following code in MIPS and what does the pseudo instruction do? (Hint: instr Rd, Rs is the format where Rd = $s0 and
Rs = $t8)
addu $s0, $t8, $0
bgez $t8, positive
sub $s0, $0, $t8
positive:

(ii) Convert the following MIPS assembly code to its equivalent single pseudo code arithmetic expression. Please note that
31415 is Pi value scaled by 10,000. For arriving at the arithmetic expression, use the same registers as given in the MIPS
code.

li $t0, 31415
mult $t8, $t8
mflo $t1
mult $t1, $t0
mflo $s0
sra $s0, $s0, 1

(b) (i) Some architectures like MARS use the addui instruction to load a register with an immediate value; where as we can
as well use an ori instruction to replace the li instruction. Which among the two (addui; ori) is faster and why? If R1 is the
destination to be loaded with an immediate value #4, write both the addui and ori instructions.
(ii) Give the output (in hexa) for various shift operations as applied to MIPS. Reproduce the table to answer.

Operation Input Output


Shift left logical 2 spaces 0x00000004
Shift right logical 3 spaces 0x0000001f
Shift right arithmetic 3 spaces 0x0000001f
Shift right arithmetic 2 spaces 0xffffffe1

(4 + 4 = 8 marks)
Q7. (a) (i) We have a single level paging scheme with a TLB implementation. Assume that the [Link] page fault is 0. The
search time to TLB is 20 ns and the time required to access the main memory is 100 ns. The TLB hit ratio is 80%.
Compute the effective memory access time. (ii) Now we have a two level paging scheme with the TLB implementation.
Assume that there are no page faults. It takes 20 ns to search the TLB and 100 ns to access the main memory. The TLB hit
ratio is 80%. What is the effective memory access time?
(b)(i) We have a 32-bit machine where the virtual address is divided into the following 4 parts and the system uses a
3-level page table implementation. The first 10 bits are used for the first page table and so on. What is the page size in
bytes in this system?

10 bits 8 bits 6 bits 8 bits

(ii) A computer system has a 36-bit virtual address space with a page size of 8K and 4 bytes per page table entry. Find out
the total count of addressable pages in the virtual address space.
(4 + 4 = 8 marks)

Q8. (a) (i) Quantifying the availability is important to understand the dependability property of modern computing systems.
A computing system comprises 1000 disks with MTTF = 100,000 hr and MTTR = 100 hr. What is the availability of this
system? (ii) We can quantify the reliability of a computing system by the metric MTTF. Let us say, we have 20 disks with
106 hr MTTF per disk. There are also 2 disk controllers with 106 hr MTTF and a CPU fan with 50000 hr MTTF. What is
the MTTF of the system as a whole? What is the FIT?
(b) (i) Compute the MTTF for high performance computing system with 1000 processors which was designed to ensure
availability, scalability and better throughput; however, it turned out that if one of the processors fails, they all fail. It is
given that the FIT for a single processor is 200. (ii) A company has 10,000 computers (as part of its server farm), each with
a MTTF of 40 days. Suppose it experiences a catastrophic failure only if 1/4 of the computers fail, what is the MTTF for
the system?
(4 + 4 = 8 marks)
Q9. (a) Let us consider a computer executing the following mix of instructions:

instruction Frequency Clock cycles


ALU 50 1
LOAD 20 5
STORE 10 3
BRANCH 20 2

(i) How much is the CPI average assuming a clock period of 5 ns ?


(ii) How much is the Speedup assuming that, introducing an optimized data cache, load instructions require 2
clock cycles?
(iii) How much is the Speedup assuming that, introducing an optimized branch unit, branch instructions require 1
clock cycle?
(iv) How much is the Speedup assuming we introduce 2 ALUs working in parallel?

(b) (i) Consider a 5-stage MIPS pipelined implementation scenario given below - but with no forwarding, and each stage
takes 1 cycle.
Sequence Instruction
No.
1 lw $s2, 0($s1)
2 lw $s1, 40($s3)
3 sub $s3, $s1, $s2
4 add $s3, $s2, $s2
5 or $s4, $s3, $zero
6 sw $s3, 50($s1)
It is decided to let the processor stall on hazards. How many times does the processor stall? How long is each stall (in
cycles)? What is the execution time (in cycles) for the whole program? (ii) For the same code segment above, now with full
forwarding, show the pipeline diagram with nops to eliminate the hazards.
(4 + 4 = 8 marks)
Q10. (a) (i) Software correction of the data flow with NOP (No Operation) is not truly an optimization mechanism, but
sometimes we need to use it occasionally when hardware mechanisms are found to be insufficient. Consider the following
code segment. Identify all the data hazards. Now solve all the data hazards using only NOP. Answer in tabular format only.
Original code with data hazards Modified code after resolving all data hazards using only NOP
next: LW R1,0(R3)
MUL R1,R1,R1
SW R1,0(R3)
SUBI R3, R3, #4
BNE R0,R3, next
(ii) Control hazards and branch misprediction adversely impact the execution time of programs. Given the branch
misprediction rate as 50% and that 1 in 5 instructions is a branch, what is the number of useful instructions that we can
fetch? Using this data, prove that halving the miss rate would double the number of useful instructions that we can try to
extract ILP from.
(b) (i) This problem pertains to floating point numbers and round off errors. Let us say a computation is done using seven
decimal digits of precision and the number x = 1/3 would be approximated as xa = 3.333333  10-1 = 0.3333333. What is
absolute error between the number x and xa? And what is the relative error?
(ii) Perform the floating point addition of the two decimal numbers 99.99 and 0.161 such that the result will have 4
significant digits. You need to follow the complete steps like normalizing the result, rounding the result etc, if required.
(4 + 4 = 8 marks)

You might also like