cs433 Fa19 hw4 Solution
cs433 Fa19 hw4 Solution
Homework 4
Total Points: Undergraduates (32 points), Graduates (50 points)
Undergraduate students should only do Problem 1. Graduate students should solve all
problems.
Due Date: October 15, 2019 at 11:00 am (See course information handout for more details)
The following code implements the DAXPY operation, 𝑌 = 𝑎𝑋 + 𝑌, for a vector length 100.
Initially, R1 is set to the base address of array X and R2 is set to the base address of Y. Assume
initial value of R3 = 0. The DADDUI instruction before the loop is initialization code and should not be
included in the answer to any of the questions.
Part A [6 points]
Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take
a cycle (each row can be an instruction or a stall). If an instruction can’t be issued in a given cycle
(because the current instruction has a dependency that will not be resolved in time), write STALL
instead, and move on to the next cycle to see if it can be issued then. Assume that a NOP is
scheduled in the branch delay slot (effectively stalling 1 cycle after the branch). Explain all stalls,
but don’t reorder instructions. How many cycles elapse before the second iteration begins? Show
your work.
Solution:
Grading: 1 point for each of stalls (1), (6), and (7). No credit without any explanation for the stall. Partial
credit of ½ point if more than one stall cycle is indicated for the corresponding instruction.
2 points total for stalls (2) to (5). Partial credit is awarded as follows, assuming there is at least one correct
explanation for the stall (e.g., at least one of the F4 or F6 dependence is listed for stall 2): 1 point if a stall
is listed between these instructions, but the number of stalls is incorrect; ½ point if at least some of the
reasons for the stalls are correct.
Part B [6 points]
Now reschedule the loop. You can change immediate values and memory offsets. You can reorder
instructions, but don’t change anything else. Show any stalls that remain. How many cycles elapse
before the second iteration begins? Show your work.
Solution:
Grading: Full points for any correct sequence with minimum number of stalls.Partial credit only if the
sequence does the same computation and reduces some stalls. Deduct ½ point for each error (e.g., incorrect
index), and deduct ½ point for each stall in excess of 2.
Part C [6 points]
Now unroll and reschedule the loop the minimum number of times needed to eliminate all stalls.
You can remove redundant instructions. How many times did you unroll the loop? How many
cycles elapse before the next iteration of the loop begins? Don’t worry about clean-up code. Show
your work.
Solution:
Grading: 1 point for the correct iteration count. Deduct ½ point for every error or stall cycle. Give partial
credit (2 points) if use three iterations instead of two and the solution is correct with three iterations.
Part D [8 points]
Consider a VLIW processor in which one instruction can support two memory operations (load or
store), one integer operation (addition, subtraction, comparison, or branch), one floating point add
or subtract, and one floating point multiply or divide. There is no branch delay slot. Now unroll
the loop four times, and schedule it for this VLIW to take as few stall cycles as possible. How
many cycles do the four iterations take to complete? Use the following table template to show your
work.
Solution:
Grading scheme: 0.5 point for each correct row in the above table.1 point if the scheduled code is correct
and takes14 cycles.
Provide the steady-state code for a software pipelined version of the loop given in this question.
Your code should give the minimum number of stalls using the minimum number of static
instructions. Assume the loop will have at least four iterations. You do not have to show the start-
up or finish-up code (i.e., prolog or epilog).
Solution:
Grading scheme:
3 points for correct offsets of instruction (1), (3a) and (6).
0.5 point for (1), 0.5 for (2), 0.25 for (3a), 0.25 for (3b), 0.5 for (6).
1 point if instructions at number (4), (5), (6), (7) and (8) are given in an order with no stall.
Alternate solution 1:
Grading scheme:
1 point if instructions at number (5), (6), (7), (8) and (9) are given in an order with no stall.
Alternate solution 2:
Grading scheme:
3 points for correct offsets of instruction (1), (4) and (7).
1.5 for x+2 instructions, 0.5 for x+1 and x+3 instructions.
1 point if instructions at number (5), (6), (7), (8) and (9) are given in an order with no stall.
NOTE: ONLY GRADUATE STUDENTS SHOULD SOLVE THE NEXT TWO
PROBLEMS.
The above translates to the MIPS fragment below. R5 and R6 store variables i and c, respectively.
Suppose the segments “Code I” (if part), “Code II” (else part), and Code III (common part) contain
10, 100, and 10 assembly instructions respectively. You did a profile run of this program and found
that on average, Branch1 is taken once in 100 iterations of the “for loop”.
Your boss suggests that you perform one of the following two transformations to speed up the
above code: (1) Loop unrolling with an unrolling factor of 2. (2) Trace scheduling.
Which one of these would be more effective and why? Show the code with the more effective
transformation applied. If you use trace scheduling, then include any repair code and branches into
and out of it. Assume that only the values of c and i may need repair. Assume that registers R10
and higher are free for your use.
Solution:
You should perform trace scheduling. Loop unrolling would increase the code size significantly
because of the huge else statement. Further, because of the if-else statement, loop unrolling will
not provide any additional longer straight-line code snippet for the compiler to schedule. On the
other hand, trace scheduling will be able to combine the “if” and “join” parts of the code together
to provide a longer fragment of straight-line code. The large else part is moved out in repair code.
The above trace can be further increased by duplicating the loop index manipulation in the trace
and repair parts.
Grading:
3 points for the trace (1 point for saving the old value of c, 1 point for combining code I and code III in
the trace, 1 point for the correct branch to repair code).
3 points for the repair code (1 point for restoring c, 1 point for combining code II and III, 1 point for the
jump back to the trace).
2 points for a fully correct answer.
If you answered Loop Unrolling, then you will be graded out of 4 for the unrolled code. 1 point for the
unrolled loop body, 1 point for correct register usage, 1 point for correct branch index manipulation, and 1
point for a fully correct unrolled loop.
Problem 3 [8 points]
The example on page H-30 of the textbook uses a speculative load instruction to move a load above
its guarding branch instruction. Read appendix H in the text for this problem and apply the
concepts to the following code:
Part A [4 points]
Write the above code using a speculative load (sL.D) and a speculation check instruction
(SPECCK) to preserve exception behavior. Where should the load instruction move to best hide
its potentially long latency?
Solution:
The speculative load instruction defers the hardware response to a memory access fault if one
occurs. In combination with the speculation check instruction this allows the load to be moved
above the branch. Because the load may have long latency, it should be moved as early in the
program as possible, in this case to the position of first instruction in the basic block. If the
speculation check finds no deferred exceptions, computation can proceed.
Grading:
Part B [4 points]
Assume a speculation check instruction that branches to the recovery code. Assume that the
speculative load instruction defers both terminating and non-terminating exceptions. Write the
above code speculating on both the load and the dependent add. Use a speculative load, a non-
speculative add, a check instruction, and the block of recovery code. How should the speculated
load and the add be scheduled with respect to each other?
Solution:
Potentially, this problem will have several different solutions. Only one is provided here.
With a speculation check instruction that can branch to the recovery code, instructions dependent
on the load can also be speculated. Now, if the load fails because of an exception for high latency
(e.g., page fault), rather than one that is a fatal error (e.g., a memory protection access violation),
the speculated use instruction can take as an operand an incorrect value from the register that is
the target of the delayed load. The speculation check instruction can distinguish these types of
exceptions, terminating the program in the event of a protection violation and branching to
recovery code for the case of a page fault, which will yield correct load behavior given sufficient
time.
back: ...
... ; etc.
null: ...
Note: Although repair code would be needed in the “null” section of the code for correct behavior
(the values of F2 and F4 need to be restored), the question explicitly does not ask for it.
Grading: