Data Dependences and Hazards

The document discusses types of data dependences including true data, name, antidependence, and output dependences. It describes how dependences can lead to different types of hazards like RAW, WAR, and WAW hazards. It also discusses compiler techniques for exploiting instruction level parallelism through loop scheduling and unrolling to reduce hazards and improve performance. Loop unrolling replicates the loop body to increase parallelism, while scheduling rearranges instructions to reduce stalls from dependencies.

Uploaded by

sshekh28374

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views

Data Dependences and Hazards

Uploaded by

sshekh28374

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Dependences and Hazards

Types of Dependences:
1) Data Dependence
2) Name Dependence
3) Control Dependence
• An instruction j is data dependent on instruction i if
either of the following holds:
• instruction i produces a result that may be used
by instruction j, or
• instruction j is data dependent on instruction k,
and instruction k is data dependent on instruction i.
• If two instructions are data dependent, they cannot
execute simultaneously or be completely overlapped.
• The dependence implies that there would be a chain
of one or more data hazards between the two
instructions.
• Dependences are a property of programs. Whether a
given dependence results in an actual hazard being
detected and whether that hazard actually causes a
stall are properties of the pipeline organization.
• This difference is critical to understanding how
instruction-level parallelism can be exploited.
• A data dependence conveys three things:
1) the possibility of a hazard,
(2) the order in which results must be
calculated, and
(3) an upper bound on how much parallelism
can possibly be exploited.
• A dependence can be overcome in two
different ways:
1) maintaining the dependence but avoiding a
hazard, and
2) eliminating a dependence by transforming the
code.
• Scheduling the code is the primary method
used to avoid a hazard without altering a
dependence, and such scheduling can be done
both by the compiler and by the hardware.
• A data value may flow between instructions either through
registers or through memory locations.
• When the data flow occurs in a register, detecting the
dependence is straightforward since the register names are
fixed in the instructions, although it gets more complicated
when branches intervene and correctness concerns force a
compiler or hardware to be conservative.
• Dependences that flow through memory locations are more
difficult to detect, since two addresses may refer to the
same location but look different:
For example,
100(R4) and 20(R6) may be identical memory addresses.
• In addition, the effective address of a load or store may
change from one execution of the instruction to another (so
that 20 (R4) and 20 (R4) may be different), further
complicating the detection of a dependence.
Name Dependences
• A name dependence occurs when two instructions use
the same register or memory location, called a name,
but there is no flow of data between the instructions
associated with that name.
• An antidependence between instruction i and
instruction j occurs when instruction j writes a register
or memory location that instruction i reads.
• The original ordering must be preserved to ensure that
i reads the correct value
• An output dependence occurs when instruction i and
instruction j write the same register or memory
location. The ordering between the instructions must
be preserved to ensure that the value finally written
corresponds to instruction j .
• Both antidependences and output dependences
are name dependences, as opposed to true data
dependences, since there is no value being
transmitted between the instructions. Since a
name dependence is not a true dependence,
• instructions involved in a name dependence can
execute simultaneously or be reordered, if the
name (register number or memory location) used
in the instructions is changed so the instructions
do not conflict.
• This renaming can be more easily done for
register operands, where it is called register
renaming. Register renaming can be done either
statically by a compiler or dynamically by the
hardware.
Data Hazards
• A hazard is created whenever there is a
dependence between instructions, and they
are close enough that the overlap during
execution would change the order of access to
the operand involved in the dependence.
• 3 Types
1) RAW (read after write)
2) WAR (write after read)
3) WAW (write after write)
• RAW (read after write)—j tries to read a source
before i writes it, so j incorrectly gets the old
value. This hazard is the most common type and
corresponds to a true data dependence.
• Program order must be preserved
• WAW (write after write)—j tries to write an
operand before it is written by i. The writes end up
being performed in the wrong order, leaving the
value writtenby i rather than the value written by j
in the destination.
• This hazard corresponds to an output
dependence. WAW hazards are present only in
pipelines that write in more than one pipe stage
or allow an instruction to proceed even when a
previous instruction is stalled.
• Consider the following assembly language
program:
I1: Move R3, R7 /R3 ← (R7)/
I2: Load R8, (R3) /R8 ← Memory (R3)/
I3: Add R3, R3, 4 /R3 ← (R3) + 4/
I4: Load R9, (R3) /R9 ← Memory (R3)/
I5: BLE R8, R9, L3 /Branch if (R9) > (R8)/
• This program includes WAW,RAW, and WAR
dependencies. Show these.
• Identify the write-read, write-write, and read-
write dependencies in the following
instruction sequence:
• I1: R1 = 100
• I2: R1 = R2 + R4
• I3: R2 = r4 – 25
• I4: R4 = R1 + R3
• I5: R1 = R1 + 30
Compiler Techniques to expose ILP-
pipeline scheduling and loop unrolling
• To keep a pipeline full, parallelism among
instructions must be exploited by finding
sequences of unrelated instructions that can
be overlapped in the pipeline.
• To avoid a pipeline stall, a dependent
instruction must be separated from the source
instruction by a distance in clock cycles equal
to the pipeline latency of that source
instruction.
Ex:- for (i = 1000; i>0; i =i-1)
x [ i ] = x [ i ] + s;
Instruction producing Instruction using result Latency in clock cycles
result
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0

Note: The last column is the number of intervening clock cycles needed to
avoid a stall
The latency of a floating-point load to a store is 0, since the result of the load
can be bypassed without stalling the store.
We will continue to assume an integer load latency of 1 and an integer ALU
operation latency of 0.
MIPS assembly language
• Loop: L.D FO.O(Rl) ; F0=array element
ADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(R1) ; store result
DADDUI Rl,Rl,#-8 ; decrement pointer ;8 bytes
(per DW)
BNE Rl,R2,Loop ; branch R1!=R2
• Without any scheduling, the loop will execute as
follows, taking 9 cycles:
Clock cycle issued
• Loop: L.D F0,0(R1) 1
stall 2
• ADD.D F4.F0.F2 3
Stall 4
stall 5
• S.D F4,0(R1) 6
• DADDUI Rl,Rl,#-8 7
stall 8
• BNE Rl,R2,Loop 9
• We can schedule the loop to obtain only two stalls
and reduce the time to 7 cycles:
• Loop: L.D F0,0(R1)
• DADDUI Rl,Rl,#-8
• ADD.D F4.F0.F2
• stall
• stall
• S.D F4,8(R1)
• BNE Rl,R2,Loop
The stalls after ADD. D are for use by the S. D.
• In the previous example, we complete one loop
iteration and store back one array element every 7
clock cycles, but the actual work of operating on the
array element takes just 3 (the load, add, and store) of
those 7 clock cycles.
• The remaining 4 clock cycles consist of loop overhead
—the DADDUI and BNE—and two stalls.
• To eliminate these 4 clock cycles we need to get more
operations relative to the number of overhead
instructions.
• A simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions is loop unrolling. Unrolling simply
replicates the loop body multiple times, adjusting the
loop termination code.
Ex: Show our loop unrolled so that there are
four copies of the loop body, assuming Rl - R2
(that is, the size of the array) is initially a
multiple of 32, which means that the number
of loop iterations is a multiple of 4. Eliminate
any obviously redundant computations and do
not reuse any of the registers.
Loop: L.D F0,0(R1)
• ADD.D F4.F0.F2
• S.D F4,0(R1) ; drop DADDUI & BNE
• L.D F6,-8(R1)
• ADD.D F8,F6,F2
• S.D F8,-8(R1) ; drop DADDUI & BNE
• L.D F10,-16(R1)
• ADD.D F12,F10,F2
• S.D F12,-16(R1) ; drop DADDUI & BNE
• L.D F14,-24(R1)
• ADD.D F16,F14,F2
• S.D F16,-24(R1)
• DADDUI Rl,Rl,#-32
• BNE Rl,R2,Loop
Note :that R2 must now be set so that 32 (R2) is the starting address
of the last four elements.
Total = 27 clock cycles
Scheduled unrolled loop
Loop:
• L.D F0,0(R1)
• L.D F6,-8(R1)
• L.D F10,-16(R1)
• L.D F14,-24(R1)
• ADD.D F4,F0,F2
• ADD.D F8,F6,F2
• ADD.D F12,F10,F2
• ADD.D F16.F14.F2
• S.D F4,0(R1)
• S.D F8,-8(R1)
• DADDUI Rl,Rl,#-32
• S.D F12,16(R1)
• S.D F16,8(R1)
• BNE Rl,R2,Loop
• The execution time of the unrolled loop has dropped to a total of 14 clock cycles,
• or 3.5 clock cycles per element, compared with 9 cycles per element before any
• unrolling or scheduling and 7 cycles when scheduled but not unrolled.
•
Summary of the Loop Unrolling and
Scheduling
• Determine that unrolling the loop would be useful by finding that
the loop iterations were independent, except for the loop
maintenance code.
• Use different registers to avoid unnecessary constraints that would
be forced by using the same registers for different computations.
• Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
• Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent.
• This transformation requires analyzing the memory addresses and
finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield
the same result as the original code.
There are three different types of limits to the
gains that can be achieved byloop unrolling:
1) decrease in the amount of overhead amortized
with each unroll, code size limitations, and
compiler limitations.
• unrolled the loop four times- in 14 clock cycles,
only 2 cycles were loop overhead: the DADDUI,
which maintains the index value, and the BNE,
which terminates the loop.
• If the loop is unrolled eight times, the overhead is
reduced from 1/2 cycle per original iteration to
1/4.
2) A second limit to unrolling is the growth in code
size that results. For larger loops, the code size
growth may be a concern particularly if it causes
an increase in the instruction cache miss rate.
3) Another factor often more important than code
size is the potential shortfall in registers that is
created by aggressive unrolling and scheduling.
• This secondary effect that results from instruction
scheduling in large code segments is called
register pressure. It arises because scheduling
code to increase ILP causes the number of live
values to increase.

IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
From Everand
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
Redouane MEDDANE
No ratings yet
Controlpanel S508
No ratings yet
Controlpanel S508
4 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Module 5 Instruction Level Parallelism and Pipelining (1)
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining (1)
54 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
5 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Unit II
No ratings yet
Unit II
84 pages
Adv Topic Compiler Supported ILPSlides
No ratings yet
Adv Topic Compiler Supported ILPSlides
18 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
37 pages
AdvTopicCompilerSupportedILP
No ratings yet
AdvTopicCompilerSupportedILP
17 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
hw5 Soln
No ratings yet
hw5 Soln
4 pages
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
No ratings yet
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
31 pages
4-Advanced pipelining_241114_060906
No ratings yet
4-Advanced pipelining_241114_060906
80 pages
Lec 11
No ratings yet
Lec 11
19 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
HW3 Solution
No ratings yet
HW3 Solution
14 pages
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
No ratings yet
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
11 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
214 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Lecture on Embedded System (part_3)
No ratings yet
Lecture on Embedded System (part_3)
38 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
Parallelism I: Inside The Core
No ratings yet
Parallelism I: Inside The Core
61 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
4 pages
forwarding assignment
No ratings yet
forwarding assignment
35 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
15IF11 Multicore A PDF
No ratings yet
15IF11 Multicore A PDF
64 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
U3.1 Concepts and Challenges[1] (1)
No ratings yet
U3.1 Concepts and Challenges[1] (1)
12 pages
Introduction To RISC and CISC:: RISC (Reduced Instruction Set Computer)
No ratings yet
Introduction To RISC and CISC:: RISC (Reduced Instruction Set Computer)
62 pages
3313
No ratings yet
3313
59 pages
chapter4_2
No ratings yet
chapter4_2
34 pages
FemtoRV32 Piplined Processor Report
No ratings yet
FemtoRV32 Piplined Processor Report
25 pages
5.Advanced-1
No ratings yet
5.Advanced-1
60 pages
CA Unit 3 Answers
No ratings yet
CA Unit 3 Answers
10 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
Assembly Language Programming: Handouts
No ratings yet
Assembly Language Programming: Handouts
44 pages
p3 - Chapter 4 - Processors and Computer architecture-6-mnlEWe66XLtD460P PDF
No ratings yet
p3 - Chapter 4 - Processors and Computer architecture-6-mnlEWe66XLtD460P PDF
8 pages
Coa Unit 4
No ratings yet
Coa Unit 4
10 pages
2014fa CS61C L31 DG PipelineII 6up
No ratings yet
2014fa CS61C L31 DG PipelineII 6up
4 pages
EC483_Fall2024_W7
No ratings yet
EC483_Fall2024_W7
40 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
COL216 Assignment 4: 1 Problem Statement
No ratings yet
COL216 Assignment 4: 1 Problem Statement
4 pages
Mips Instructions
No ratings yet
Mips Instructions
30 pages
Introduction To Assembler: Microcontroller VL Thomas Nowak TU Wien
No ratings yet
Introduction To Assembler: Microcontroller VL Thomas Nowak TU Wien
26 pages
lecture 3- Instruction Set Architecture
No ratings yet
lecture 3- Instruction Set Architecture
30 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
KNM TPM JM Land-merged
No ratings yet
KNM TPM JM Land-merged
65 pages
AKM1 Compressed
No ratings yet
AKM1 Compressed
3 pages
ICFT3
No ratings yet
ICFT3
18 pages
Operating System
No ratings yet
Operating System
79 pages
DFP640
No ratings yet
DFP640
7 pages
9-3 Sanxing DCU
No ratings yet
9-3 Sanxing DCU
6 pages
PHYSICS INVESTIGATORY PROJECT Step Down
No ratings yet
PHYSICS INVESTIGATORY PROJECT Step Down
21 pages
XTC5400 Datasheet
No ratings yet
XTC5400 Datasheet
2 pages
C48C Product Manual
No ratings yet
C48C Product Manual
55 pages
Wireless Charging System
No ratings yet
Wireless Charging System
7 pages
Circuit Lab Manual Exp7 01
No ratings yet
Circuit Lab Manual Exp7 01
9 pages
Enterprise Digital Assistant: Specification Sheet
No ratings yet
Enterprise Digital Assistant: Specification Sheet
4 pages
3RP15052BW30 Datasheet en
No ratings yet
3RP15052BW30 Datasheet en
5 pages
093 82ServiceManualEN101
No ratings yet
093 82ServiceManualEN101
147 pages
12th Project Work Physics
No ratings yet
12th Project Work Physics
2 pages
Datasheet Ic Teclado La4227
No ratings yet
Datasheet Ic Teclado La4227
4 pages
T.E (2019 Pattern)
No ratings yet
T.E (2019 Pattern)
852 pages
LG LTV Pricelist (March 2019)
No ratings yet
LG LTV Pricelist (March 2019)
3 pages
Technical Application Guide LINEARlight Flex Protect (GB)
No ratings yet
Technical Application Guide LINEARlight Flex Protect (GB)
18 pages
Analizador Carlos Gavazzi
No ratings yet
Analizador Carlos Gavazzi
7 pages
PLC Based Elevator Control System: BARUN KUMAR SINGH (D202101956) MD MASOOM (D202101963)
No ratings yet
PLC Based Elevator Control System: BARUN KUMAR SINGH (D202101956) MD MASOOM (D202101963)
17 pages
Lecture-05 Simple Circuit Reductions
No ratings yet
Lecture-05 Simple Circuit Reductions
8 pages
EE-425 Exp 3 PDF
No ratings yet
EE-425 Exp 3 PDF
13 pages
Signal Processing Techniques For Software Radio: Behrouz Farhang-Boroujeny
No ratings yet
Signal Processing Techniques For Software Radio: Behrouz Farhang-Boroujeny
7 pages
Mugen 22 Agus 2020 (5 Lembar)
No ratings yet
Mugen 22 Agus 2020 (5 Lembar)
29 pages
LTM220M1-L01 Dar
No ratings yet
LTM220M1-L01 Dar
35 pages
Festo SPAN-B PSI EN142493 202108 V01
No ratings yet
Festo SPAN-B PSI EN142493 202108 V01
2 pages
Wag354g-Eu Ug
No ratings yet
Wag354g-Eu Ug
366 pages
Archivos PASCO - CA-6787 PDF Files - 3 Thermo., Wav
No ratings yet
Archivos PASCO - CA-6787 PDF Files - 3 Thermo., Wav
8 pages
Miller & Beasley Example Problems chapter 7- 13
No ratings yet
Miller & Beasley Example Problems chapter 7- 13
3 pages
Some J-Poles That I Have Known 1
No ratings yet
Some J-Poles That I Have Known 1
8 pages
802 Abs
No ratings yet
802 Abs
9 pages
Section 2 Lab 3 Ph102 H
No ratings yet
Section 2 Lab 3 Ph102 H
12 pages

Data Dependences and Hazards

Uploaded by

Data Dependences and Hazards

Uploaded by

Data Dependences and Hazards

You might also like