0% found this document useful (0 votes)
181 views

Data Dependences and Hazards

The document discusses types of data dependences including true data, name, antidependence, and output dependences. It describes how dependences can lead to different types of hazards like RAW, WAR, and WAW hazards. It also discusses compiler techniques for exploiting instruction level parallelism through loop scheduling and unrolling to reduce hazards and improve performance. Loop unrolling replicates the loop body to increase parallelism, while scheduling rearranges instructions to reduce stalls from dependencies.

Uploaded by

sshekh28374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views

Data Dependences and Hazards

The document discusses types of data dependences including true data, name, antidependence, and output dependences. It describes how dependences can lead to different types of hazards like RAW, WAR, and WAW hazards. It also discusses compiler techniques for exploiting instruction level parallelism through loop scheduling and unrolling to reduce hazards and improve performance. Loop unrolling replicates the loop body to increase parallelism, while scheduling rearranges instructions to reduce stalls from dependencies.

Uploaded by

sshekh28374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Dependences and Hazards

Types of Dependences:
1) Data Dependence
2) Name Dependence
3) Control Dependence
• An instruction j is data dependent on instruction i if
either of the following holds:
• instruction i produces a result that may be used
by instruction j, or
• instruction j is data dependent on instruction k,
and instruction k is data dependent on instruction i.
• If two instructions are data dependent, they cannot
execute simultaneously or be completely overlapped.
• The dependence implies that there would be a chain
of one or more data hazards between the two
instructions.
• Dependences are a property of programs. Whether a
given dependence results in an actual hazard being
detected and whether that hazard actually causes a
stall are properties of the pipeline organization.
• This difference is critical to understanding how
instruction-level parallelism can be exploited.
• A data dependence conveys three things:
1) the possibility of a hazard,
(2) the order in which results must be
calculated, and
(3) an upper bound on how much parallelism
can possibly be exploited.
• A dependence can be overcome in two
different ways:
1) maintaining the dependence but avoiding a
hazard, and
2) eliminating a dependence by transforming the
code.
• Scheduling the code is the primary method
used to avoid a hazard without altering a
dependence, and such scheduling can be done
both by the compiler and by the hardware.
• A data value may flow between instructions either through
registers or through memory locations.
• When the data flow occurs in a register, detecting the
dependence is straightforward since the register names are
fixed in the instructions, although it gets more complicated
when branches intervene and correctness concerns force a
compiler or hardware to be conservative.
• Dependences that flow through memory locations are more
difficult to detect, since two addresses may refer to the
same location but look different:
For example,
100(R4) and 20(R6) may be identical memory addresses.
• In addition, the effective address of a load or store may
change from one execution of the instruction to another (so
that 20 (R4) and 20 (R4) may be different), further
complicating the detection of a dependence.
Name Dependences
• A name dependence occurs when two instructions use
the same register or memory location, called a name,
but there is no flow of data between the instructions
associated with that name.
• An antidependence between instruction i and
instruction j occurs when instruction j writes a register
or memory location that instruction i reads.
• The original ordering must be preserved to ensure that
i reads the correct value
• An output dependence occurs when instruction i and
instruction j write the same register or memory
location. The ordering between the instructions must
be preserved to ensure that the value finally written
corresponds to instruction j .
• Both antidependences and output dependences
are name dependences, as opposed to true data
dependences, since there is no value being
transmitted between the instructions. Since a
name dependence is not a true dependence,
• instructions involved in a name dependence can
execute simultaneously or be reordered, if the
name (register number or memory location) used
in the instructions is changed so the instructions
do not conflict.
• This renaming can be more easily done for
register operands, where it is called register
renaming. Register renaming can be done either
statically by a compiler or dynamically by the
hardware.
Data Hazards
• A hazard is created whenever there is a
dependence between instructions, and they
are close enough that the overlap during
execution would change the order of access to
the operand involved in the dependence.
• 3 Types
1) RAW (read after write)
2) WAR (write after read)
3) WAW (write after write)
• RAW (read after write)—j tries to read a source
before i writes it, so j incorrectly gets the old
value. This hazard is the most common type and
corresponds to a true data dependence.
• Program order must be preserved
• WAW (write after write)—j tries to write an
operand before it is written by i. The writes end up
being performed in the wrong order, leaving the
value writtenby i rather than the value written by j
in the destination.
• This hazard corresponds to an output
dependence. WAW hazards are present only in
pipelines that write in more than one pipe stage
or allow an instruction to proceed even when a
previous instruction is stalled.
• Consider the following assembly language
program:
I1: Move R3, R7 /R3 ← (R7)/
I2: Load R8, (R3) /R8 ← Memory (R3)/
I3: Add R3, R3, 4 /R3 ← (R3) + 4/
I4: Load R9, (R3) /R9 ← Memory (R3)/
I5: BLE R8, R9, L3 /Branch if (R9) > (R8)/
• This program includes WAW,RAW, and WAR
dependencies. Show these.
• Identify the write-read, write-write, and read-
write dependencies in the following
instruction sequence:
• I1: R1 = 100
• I2: R1 = R2 + R4
• I3: R2 = r4 – 25
• I4: R4 = R1 + R3
• I5: R1 = R1 + 30
Compiler Techniques to expose ILP-
pipeline scheduling and loop unrolling
• To keep a pipeline full, parallelism among
instructions must be exploited by finding
sequences of unrelated instructions that can
be overlapped in the pipeline.
• To avoid a pipeline stall, a dependent
instruction must be separated from the source
instruction by a distance in clock cycles equal
to the pipeline latency of that source
instruction.
Ex:- for (i = 1000; i>0; i =i-1)
x [ i ] = x [ i ] + s;
Instruction producing Instruction using result Latency in clock cycles
result
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0

Note: The last column is the number of intervening clock cycles needed to
avoid a stall
The latency of a floating-point load to a store is 0, since the result of the load
can be bypassed without stalling the store.
We will continue to assume an integer load latency of 1 and an integer ALU
operation latency of 0.
MIPS assembly language
• Loop: L.D FO.O(Rl) ; F0=array element
ADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(R1) ; store result
DADDUI Rl,Rl,#-8 ; decrement pointer ;8 bytes
(per DW)
BNE Rl,R2,Loop ; branch R1!=R2
• Without any scheduling, the loop will execute as
follows, taking 9 cycles:
Clock cycle issued
• Loop: L.D F0,0(R1) 1
stall 2
• ADD.D F4.F0.F2 3
Stall 4
stall 5
• S.D F4,0(R1) 6
• DADDUI Rl,Rl,#-8 7
stall 8
• BNE Rl,R2,Loop 9
• We can schedule the loop to obtain only two stalls
and reduce the time to 7 cycles:
• Loop: L.D F0,0(R1)
• DADDUI Rl,Rl,#-8
• ADD.D F4.F0.F2
• stall
• stall
• S.D F4,8(R1)
• BNE Rl,R2,Loop
The stalls after ADD. D are for use by the S. D.
• In the previous example, we complete one loop
iteration and store back one array element every 7
clock cycles, but the actual work of operating on the
array element takes just 3 (the load, add, and store) of
those 7 clock cycles.
• The remaining 4 clock cycles consist of loop overhead
—the DADDUI and BNE—and two stalls.
• To eliminate these 4 clock cycles we need to get more
operations relative to the number of overhead
instructions.
• A simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions is loop unrolling. Unrolling simply
replicates the loop body multiple times, adjusting the
loop termination code.
Ex: Show our loop unrolled so that there are
four copies of the loop body, assuming Rl - R2
(that is, the size of the array) is initially a
multiple of 32, which means that the number
of loop iterations is a multiple of 4. Eliminate
any obviously redundant computations and do
not reuse any of the registers.
Loop: L.D F0,0(R1)
• ADD.D F4.F0.F2
• S.D F4,0(R1) ; drop DADDUI & BNE
• L.D F6,-8(R1)
• ADD.D F8,F6,F2
• S.D F8,-8(R1) ; drop DADDUI & BNE
• L.D F10,-16(R1)
• ADD.D F12,F10,F2
• S.D F12,-16(R1) ; drop DADDUI & BNE
• L.D F14,-24(R1)
• ADD.D F16,F14,F2
• S.D F16,-24(R1)
• DADDUI Rl,Rl,#-32
• BNE Rl,R2,Loop
Note :that R2 must now be set so that 32 (R2) is the starting address
of the last four elements.
Total = 27 clock cycles
Scheduled unrolled loop
Loop:
• L.D F0,0(R1)
• L.D F6,-8(R1)
• L.D F10,-16(R1)
• L.D F14,-24(R1)
• ADD.D F4,F0,F2
• ADD.D F8,F6,F2
• ADD.D F12,F10,F2
• ADD.D F16.F14.F2
• S.D F4,0(R1)
• S.D F8,-8(R1)
• DADDUI Rl,Rl,#-32
• S.D F12,16(R1)
• S.D F16,8(R1)
• BNE Rl,R2,Loop
• The execution time of the unrolled loop has dropped to a total of 14 clock cycles,
• or 3.5 clock cycles per element, compared with 9 cycles per element before any
• unrolling or scheduling and 7 cycles when scheduled but not unrolled.

Summary of the Loop Unrolling and
Scheduling
• Determine that unrolling the loop would be useful by finding that
the loop iterations were independent, except for the loop
maintenance code.
• Use different registers to avoid unnecessary constraints that would
be forced by using the same registers for different computations.
• Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
• Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent.
• This transformation requires analyzing the memory addresses and
finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield
the same result as the original code.
There are three different types of limits to the
gains that can be achieved byloop unrolling:
1) decrease in the amount of overhead amortized
with each unroll, code size limitations, and
compiler limitations.
• unrolled the loop four times- in 14 clock cycles,
only 2 cycles were loop overhead: the DADDUI,
which maintains the index value, and the BNE,
which terminates the loop.
• If the loop is unrolled eight times, the overhead is
reduced from 1/2 cycle per original iteration to
1/4.
2) A second limit to unrolling is the growth in code
size that results. For larger loops, the code size
growth may be a concern particularly if it causes
an increase in the instruction cache miss rate.
3) Another factor often more important than code
size is the potential shortfall in registers that is
created by aggressive unrolling and scheduling.
• This secondary effect that results from instruction
scheduling in large code segments is called
register pressure. It arises because scheduling
code to increase ILP causes the number of live
values to increase.

You might also like