0% found this document useful (0 votes)

66 views24 pages

CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques

The document discusses techniques for increasing instruction-level parallelism (ILP) in computer processors, including deeper pipelining with more stages, superscalar designs that can issue multiple instructions per clock cycle, and static multiple-issue designs like VLIW that rely on the compiler to schedule instructions. It also covers loop unrolling to increase parallelism by scheduling instructions from different loop iterations together.

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views24 pages

CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

CS3350B

Computer Architecture
Winter 2015

Lecture 6.3: Instructional Level Parallelism:

Advanced Techniques

Marc Moreno Maza

www.csd.uwo.ca/Courses/CS3350b

[Adapted from lectures on Computer Organization and Design,

Patterson & Hennessy, 5th edition, 2011]

0
Greater Instruction-Level Parallelism
 Deeper pipeline (more #stages: 5 => 10 => 15 stages)
 Less work per stage  shorter clock cycle

 Multiple issue “superscalar”

 Replicate pipeline stages  multiple pipelines
- e.g., have two ALUs or a register file with 4 read ports and 2 write ports
- have logic to issue several instructions concurrently
 Execute more than one instruction at a clock cycle, producing an
effective CPI < 1, so use Instructions Per Cycle (IPC)
 e.g., 4GHz 4-way multiple-issue
- 16 BIPS, peak CPI = 0.25, peak IPC = 4
 If a datapath has a 5-stage pipeline, how many instructions are active
in the pipeline at any given time?
 But dependencies reduce this in practice

1
Pipeline Depth and Issue Width

 Intel Processors over Time

Microprocessor Year Clock Pipeline Issue Cores Power

Rate Stages width
i486 1989 25 MHz 5 1 1 5W
Pentium 1993 66 MHz 5 2 1 10W
Pentium Pro 1997 200 MHz 10 3 1 29W
P4 Willamette 2001 2000 MHz 22 3 1 75W
P4 Prescott 2004 3600 MHz 31 3 1 103W
Core 2 Conroe 2006 2930 MHz 14 4 2 75W
Core 2 Yorkfield 2008 2930 MHz 16 4 4 95W
Core i7 Gulftown 2010 3460 MHz 16 4 6 130W

2
Multiple-Issue Processor Styles
 Static multiple-issue processors, aka VLIW (very-long
instruction word)
 Decisions on which instructions to execute simultaneously are
being made statically (at compile time by the compiler)
 e.g. Intel Itanium and Itanium 2
- 128-bit “bundles” containing three instructions
- Five functional units (IntALU, Mmedia, Dmem, FPALU,
Branch)
- Extensive support for speculation and predication
 Dynamic multiple-issue processors (aka SuperScalar)
 Decisions on which instructions to execute simultaneously (in
the range of 2 to 8) are being made dynamically (at run time
by the hardware)
- e.g., IBM power series, Pentium 4, MIPS R10K, AMD Barcelona
3
Multiple-Issue Datapath Responsibilities
 Must handle, with a combination of hardware and software
fixes, the fundamental limitations of
 How many instructions to issue in one clock cycle – issue slots
 Storage (data) dependencies – aka data hazards
- Limitation more severe in a SS/VLIW processor due to (usually) low
ILP
 Procedural dependencies – aka control hazards
- Ditto, but even more severe
- Use dynamic branch prediction to help resolve the ILP issue
 Resource conflicts – aka structural hazards
- A SS/VLIW processor has a much larger number of potential
resource conflicts
- Functional units may have to arbitrate for result buses and register-
file write ports
- Resource conflicts can be eliminated by duplicating the resource or
by pipelining the resource

4
Static Multiple Issue Machines (VLIW)
 Static multiple-issue processors (aka VLIW) use the
compiler (at compile-time) to statically decide which
instructions to issue and execute simultaneously
 Issue packet – the set of instructions that are bundled together
and issued in one clock cycle – think of it as one large instruction
with multiple operations
 The mix of instructions in the packet (bundle) is usually restricted
– a single “instruction” with several predefined fields
 The compiler does static branch prediction and code
scheduling to reduce (control) or eliminate (data) hazards

 VLIW’s have
 Multiple functional units
 Multi-ported register files
 Wide program bus

5
An Example: A VLIW MIPS
 Consider a 2-issue MIPS with a 2 instr bundle
64 bits

ALU Op (R format) Load or Store (I format)

or
Branch (I format)

 Instructions are always fetched, decoded, and issued in

pairs
 If one instr of the pair can not be used, it is replaced with a nop

 Need 4 read ports and 2 write ports and a separate

memory address adder

6
Code Scheduling Example
 Consider the following loop code
lp: lw $t0,0($s1) # $t0=array element
addu $t0,$t0,$s2 # add scalar in $s2
sw $t0,0($s1) # store result
addi $s1,$s1,-4 # decrement pointer
bne $s1,$0,lp # branch if $s1 != 0

/* increment each element (unsigned integer) in array A by n */

for (i=m; i>=0; --i) /* m is the initial value of $s1 */
A[i] += n; /* n is the value in register $s2 */

 Must “schedule” the instructions to avoid pipeline stalls

 Instructions in one bundle must be independent
 Must separate load/use instructions from their loads by one cycle
 Notice that the first two instructions have a load/use
dependency, the next two and last two have data dependencies
 Assume branches are perfectly predicted by the hardware 7
The Scheduled Code (Not Unrolled)
ALU or branch Data transfer CC
lp: nop lw $t0,0($s1) 1
addi $s1,$s1,-4 nop 2
addu $t0,$t0,$s2 nop 3
bne $s1,$0,lp sw $t0,4($s1) 4
lp: lw $t0,0($s1) # $t0=array element
addu $t0,$t0,$s2 # add scalar in $s2
sw $t0,0($s1) # store result
addi $s1,$s1,-4 # decrement pointer
bne $s1,$0,lp # branch if $s1 != 0

 Four clock cycles to execute 5 instructions for a

 CPI of 0.8 (versus the best case of 0.5?)
 IPC of 1.25 (versus the best case of 2.0?)
 noops don’t count towards performance !!
Loop Unrolling
 Loop unrolling – multiple copies of the loop body are
made and instructions from different iterations are
scheduled together as a way to increase ILP

 Apply loop unrolling (4 times for our example) and then

schedule the resulting code
 Eliminate unnecessary loop overhead instructions
 Schedule so as to avoid load use hazards

 During unrolling the compiler applies register renaming to

eliminate all data dependencies that are not true data
dependencies

9
Loop Unrolling in C
for (i=m; i>=0; --i) /* unrolled 4 times */
A[i] += n; for (i=m; i>=0; i-=4){
A[i] += n;
Assume size of A is 8, i.e. m=7. A[i-1] += n;
A[i-2] += n;
Execute not-unrolled code: A[i-3] += n; }
Iteration # i Instruction
1 7 A[7] += n Execute unrolled code:
2 6 A[6] += n
Iteration #1, i=7:
3 5 A[5] += n
{ A[7] += n;
4 4 A[4] += n
A[6] += n;
5 3 A[3] += n
A[5] += n;
6 2 A[2] += n
A[4] += n; }
7 1 A[1] += n
8 0 A[0] += n
Iteration #2, i=3:
{ A[3] += n;
A[2] += n;
A[1] += n;
A[0] += n; }

10
Apply Loop Unrolling for 4 times
lp: lw $t0,0($s1) # $t0=array element /* code in c */
lw $t1,-4($s1) # $t1=array element for(i=m;i>=0;i-=4)
lw $t2,-8($s1) # $t2=array element {
lw $t3,-12($s1)# $t3=array element A[i] += n;
addu $t0,$t0,$s2 # add scalar in $s2 A[i-1] += n;
A[i-2] += n;
addu $t1,$t1,$s2 # add scalar in $s2
A[i-3] += n;
addu $t2,$t2,$s2 # add scalar in $s2
}
addu $t3,$t3,$s2 # add scalar in $s2
sw $t0,0($s1) # store result
• Why not reuse $t0
sw $t1,-4($s1) # store result
but use $t1, $t2,
sw $t2,-8($s1) # store result
sw $t3,-12($s1)# store result
$t3?
addi $s1,$s1,-16 # decrement pointer
bne $s1,$0,lp # branch if $s1 != 0 • Why -4,-8,-12 and
$s1=$s1-16?
lp: lw $t0,0($s1) # $t0=array element
addu $t0,$t0,$s2# add scalar in $s2
• How many times
sw $t0,0($s1) # store result
can a loop be
addi $s1,$s1,-4 # decrement pointer
unrolled?
bne $s1,$0,lp # branch if $s1!=0 11
The Scheduled Code (Unrolled)
ALU or branch Data transfer CC
lp: addi $s1,$s1,-16 lw $t0,0($s1) 1
nop lw $t1,12($s1) #-4 2
addu $t0,$t0,$s2 lw $t2,8($s1) #-8 3
addu $t1,$t1,$s2 lw $t3,4($s1) #-12 4
addu $t2,$t2,$s2 sw $t0,16($s1) #0 5
addu $t3,$t3,$s2 sw $t1,12($s1) #-4 6
nop sw $t2,8($s1) #-8 7
bne $s1,$0,lp sw $t3,4($s1) #-12 8

/* code in c */  Eight clock cycles to execute

for(i=m;i>=0;i-=4) 14 instructions for a
{
A[i] += n;  CPI of 0.57
A[i-1] += n; (versus the best case of 0.5)
A[i-2] += n;  IPC of 1.8
A[i-3] += n; (versus the best case of 2.0)
}
Summary of Compiler Support for VLIW Processors

 The compiler packs groups of independent instructions

into the bundle
 Done by code re-ordering (trace scheduling)

 The compiler uses loop unrolling to expose more ILP

 The compiler uses register renaming to solve name
dependencies and ensures no load use hazards occur
 While superscalars use dynamic prediction, VLIW’s
primarily depend on the compiler for branch prediction
 Loop unrolling reduces the number of conditional branches
 Predication eliminates if-then-else branch structures by replacing
them with predicated instructions

 The compiler predicts memory bank references to help

minimize memory bank conflicts
14
VLIW Advantages & Disadvantages
 Advantages
 Simpler hardware (potentially less power hungry)
 Potentially more scalable
- Allow more instr’s per VLIW bundle and add more FUs
 Disadvantages
 Programmer/compiler complexity and longer compilation times
- Deep pipelines and long latencies can be confusing (making peak
performance elusive)
 Lock step operation, i.e., on hazard all future issues stall until
hazard is resolved (hence need for predication)
 Object (binary) code incompatibility
 Needs lots of program memory bandwidth
 Code bloat
- Noops are a waste of program memory space
- Loop unrolling to expose more ILP uses more program memory
space
15
Dynamic Multiple Issue Machines (SS)
 Dynamic multiple-issue processors (aka SuperScalar) use
hardware at run-time to dynamically decide which
instructions to issue and execute simultaneously
 Instruction-fetch and issue – fetch instructions, decode
them, and issue them to a FU to await execution
 Instruction-execution – as soon as the source operands
and the FU are ready, the result can be calculated
 Instruction-commit – when it is safe to, write back results
to the RegFile or D$ (i.e., change the machine state)

16
Dynamic Multiple Issue Machines (SS)

17
Dynamic Pipeline Scheduling
 Allow the CPU to execute instructions out of order to
avoid stalls
 But commit result to registers in order

 Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
subu $s4, $s4, $t3
slti $t5, $s4, 20
 Can start subu while addu is waiting for lw

18
Why Do Dynamic Scheduling?
 Why not just let the compiler schedule code?
 Disadvantages of complier scheduling code

 Not all stalls are predicable

 e.g., cache misses

 Can’t always schedule around branches

 Branch outcome is dynamically determined

 Different implementations of an ISA have different

latencies and hazards

19
Speculation
 “Guess” what to do with an instruction
 Start operation as soon as possible
 Check whether guess was right
- If so, complete the operation
- If not, roll-back and do the right thing

 Common to static and dynamic multiple issue

 Examples
 Speculate on branch outcome (Branch Prediction)
- Roll back if path taken is different
 Speculate on load
- Roll back if location is updated

20
Out Of Order Intel

 All use OOO since 2001

Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power

Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Core 2 Yorkfield 2008 2930MHz 16 4 Yes 4 95W
Core i7 Gulftown 2010 3460MHz 16 4 Yes 6 130W

21
Streaming SIMD Extensions (SSE)
 SIMD: Single Instruction Multiple Data
 A data parallel architecture
 Both current AMD and Intel’s x86 processors have ISA
and micro-architecture support SIMD operations
 MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX
 Many functional units
 8 128‐bit vector registers: XMM0, XMM1, …, XMM7
 See the flag field in /proc/cpuinfo

 SSE (Streaming SIMD extensions): a SIMD instruction

set extension to the x86 architecture
 Instructions for operating on multiple data simultaneously (vector
operations): for (i=0; i<n; ++i) Z[i]=X[i]+Y[i];

 Programming SSE in C++: intrinsics

22
Does Multiple Issue Work?
 Yes, but not as much as we’d like
 Programs have real dependencies that limit ILP
 Some dependencies are hard to eliminate
 e.g., pointer aliasing
 Some parallelism is hard to expose
 Limited window size during instruction issue
 Memory delays and limited bandwidth
 Hard to keep pipelines full
 Speculation can help if done well

23
Takeaway
 Pipelining is an important form of ILP
 Challenge is hazards
 Forwarding helps with many data hazards
 Delayed branch helps with control hazard in 5 stage pipeline
 Load delay slot / interlock necessary

 More aggressive performance:

 Longer pipelines
 VLIW
 Superscalar
 Out-of-order execution
 Speculation

 SSE?

Green Code-Construction Materials Book
100% (3)
Green Code-Construction Materials Book
133 pages
Gravity Retaining Wall Design
100% (1)
Gravity Retaining Wall Design
10 pages
Design & Build Contract Sample
No ratings yet
Design & Build Contract Sample
41 pages
Ariston Oven Manual
No ratings yet
Ariston Oven Manual
16 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Lecture12 Vliw
No ratings yet
Lecture12 Vliw
19 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
Unit II
No ratings yet
Unit II
84 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
9 Loop Unrolling
No ratings yet
9 Loop Unrolling
21 pages
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
No ratings yet
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
18 pages
Exploiting ILP With Software Approach
No ratings yet
Exploiting ILP With Software Approach
104 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Lec 15
No ratings yet
Lec 15
15 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
No ratings yet
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
46 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
13 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
4 pages
Cs152 Sp16 F Sol VLIW
No ratings yet
Cs152 Sp16 F Sol VLIW
40 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
L20
No ratings yet
L20
35 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
2.advanced Compiler Support For ILP
100% (1)
2.advanced Compiler Support For ILP
16 pages
5.Advanced-1
No ratings yet
5.Advanced-1
60 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
Computer_Architecture_ILP_-_techniques_for_increasing
No ratings yet
Computer_Architecture_ILP_-_techniques_for_increasing
11 pages
2011quiz4sol
No ratings yet
2011quiz4sol
17 pages
EE457Unit9a_OoO
No ratings yet
EE457Unit9a_OoO
77 pages
Chapter 6 PPTV 2004 Short V1
No ratings yet
Chapter 6 PPTV 2004 Short V1
21 pages
Data Dependences and Hazards
No ratings yet
Data Dependences and Hazards
24 pages
Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)
No ratings yet
Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)
35 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Zareen 14
No ratings yet
Zareen 14
9 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
No ratings yet
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
21 pages
Solution 2
No ratings yet
Solution 2
3 pages
COL216 Assignment 4: 1 Problem Statement
No ratings yet
COL216 Assignment 4: 1 Problem Statement
4 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Adv Topic Compiler Supported ILPSlides
No ratings yet
Adv Topic Compiler Supported ILPSlides
18 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
24 pages
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
33 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
30 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
No ratings yet
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
18 pages
AlgNotes PDF
No ratings yet
AlgNotes PDF
106 pages
L7 Multicore 2
No ratings yet
L7 Multicore 2
22 pages
Section 9
No ratings yet
Section 9
60 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
No ratings yet
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
31 pages
Nano PDF
No ratings yet
Nano PDF
1 page
مكتبة نور - مميز بالاصفر PDF
No ratings yet
مكتبة نور - مميز بالاصفر PDF
229 pages
Cara PING IUB Node B Ericsson Dari RNC
No ratings yet
Cara PING IUB Node B Ericsson Dari RNC
7 pages
Make Him Move On in 50 Days
No ratings yet
Make Him Move On in 50 Days
110 pages
Villa Railing
No ratings yet
Villa Railing
13 pages
CIE 5060 Study and Comparison of Shell Design Codes
No ratings yet
CIE 5060 Study and Comparison of Shell Design Codes
190 pages
rc159-HBase 7 PDF
No ratings yet
rc159-HBase 7 PDF
7 pages
How To Configure Network Load Balancing
No ratings yet
How To Configure Network Load Balancing
13 pages
Emscher Landscape Park - A New Regional Park in The Ruhr Area (Germany)
No ratings yet
Emscher Landscape Park - A New Regional Park in The Ruhr Area (Germany)
6 pages
Cant Bridge WLAN With LAN - Error - Index of The Interface - Portshield Interfaces Cant Be Assigned
No ratings yet
Cant Bridge WLAN With LAN - Error - Index of The Interface - Portshield Interfaces Cant Be Assigned
8 pages
Network Programming in C#
No ratings yet
Network Programming in C#
13 pages
Seatwork 1 Mathematics: Specific Weight
No ratings yet
Seatwork 1 Mathematics: Specific Weight
5 pages
KPMG IT Spending
No ratings yet
KPMG IT Spending
36 pages
Sanitary/Plumbing System Layout Hot & Cold Waterline System Layout
No ratings yet
Sanitary/Plumbing System Layout Hot & Cold Waterline System Layout
1 page
Jharkhand-Rajeev Shreya-5 April 2020
No ratings yet
Jharkhand-Rajeev Shreya-5 April 2020
16 pages
Building Utilities 1 Lecture 1 PDF
57% (7)
Building Utilities 1 Lecture 1 PDF
3 pages
THEOTOWN
No ratings yet
THEOTOWN
1 page
Nautai-2_Pipe Material Requirement
No ratings yet
Nautai-2_Pipe Material Requirement
1 page
Catalogue: Transmission Network Access Solutions
No ratings yet
Catalogue: Transmission Network Access Solutions
24 pages
Automatic Ventilation System
No ratings yet
Automatic Ventilation System
69 pages
Lenovo G580 G480 Schematic
No ratings yet
Lenovo G580 G480 Schematic
53 pages
FANUC PICTURE Specification (Edition 8.0 or Later)
No ratings yet
FANUC PICTURE Specification (Edition 8.0 or Later)
758 pages
Incremental Housing Development An Approach in Mee PDF
No ratings yet
Incremental Housing Development An Approach in Mee PDF
8 pages
Ht503 Usermanual English
No ratings yet
Ht503 Usermanual English
59 pages
Modernism Architecture and Le Corbusier
No ratings yet
Modernism Architecture and Le Corbusier
3 pages
Synopsis On: ATS' SBGI, Miraj. Dept. of Civil Engineering
No ratings yet
Synopsis On: ATS' SBGI, Miraj. Dept. of Civil Engineering
15 pages
Mandir Cataloge
No ratings yet
Mandir Cataloge
15 pages
Wireless Packet Analyzer Tool With Ip Traceroute PDF
No ratings yet
Wireless Packet Analyzer Tool With Ip Traceroute PDF
9 pages

CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques

Uploaded by

CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques

Uploaded by

CS3350B

Lecture 6.3: Instructional Level Parallelism:

Marc Moreno Maza

[Adapted from lectures on Computer Organization and Design,

 Multiple issue “superscalar”

 Intel Processors over Time

Microprocessor Year Clock Pipeline Issue Cores Power

ALU Op (R format) Load or Store (I format)

 Instructions are always fetched, decoded, and issued in

 Need 4 read ports and 2 write ports and a separate

/* increment each element (unsigned integer) in array A by n */

 Must “schedule” the instructions to avoid pipeline stalls

 Four clock cycles to execute 5 instructions for a

 Apply loop unrolling (4 times for our example) and then

 During unrolling the compiler applies register renaming to

/* code in c */  Eight clock cycles to execute

 The compiler packs groups of independent instructions

 The compiler uses loop unrolling to expose more ILP

 The compiler predicts memory bank references to help

 Not all stalls are predicable

 Can’t always schedule around branches

 Different implementations of an ISA have different

 Common to static and dynamic multiple issue

 All use OOO since 2001

Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power

 SSE (Streaming SIMD extensions): a SIMD instruction

 Programming SSE in C++: intrinsics

 More aggressive performance:

You might also like