0% found this document useful (0 votes)
42 views

Pipelining and Vector Processing

The document discusses different types and levels of parallel processing including job level, task level, inter-instruction level, and intra-instruction level parallelism. It also describes Flynn's classification of parallel architectures based on the number of instruction and data streams. Common parallel architectures include shared memory multiprocessors, message passing multicomputers, array processors, and systolic arrays. Pipelining is introduced as a technique to decompose a sequential process into overlapping suboperations to improve computational speed. Challenges with pipelining like structural hazards, data hazards, and control hazards are also summarized.

Uploaded by

Sameer Salam
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Pipelining and Vector Processing

The document discusses different types and levels of parallel processing including job level, task level, inter-instruction level, and intra-instruction level parallelism. It also describes Flynn's classification of parallel architectures based on the number of instruction and data streams. Common parallel architectures include shared memory multiprocessors, message passing multicomputers, array processors, and systolic arrays. Pipelining is introduced as a technique to decompose a sequential process into overlapping suboperations to improve computational speed. Challenges with pipelining like structural hazards, data hazards, and control hazards are also summarized.

Uploaded by

Sameer Salam
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Levels of Parallel Processing

- Job or Program level


- Task or Procedure level
- Inter-Instruction level
- Intra-Instruction level
Execution of Concurrent Events in the computing
process to achieve faster Computational Speed
Parallel Processing
Flynn's classification

Based on the multiplicity of Instruction Streams and Data


Streams

Instruction Stream

Sequence of Instructions read from memory

Data Stream

Operations performed on the data in the processor


Architectural Classification
umber of Data Streams
umber of
Instruction
Streams
!ingle
"ultiple
!ingle "ultiple
!I!# !I"#
"I!# "I"#
Parallel Processing
$on-euman
based
#ataflo%
&eduction
!I!#
"I!#
!I"#
"I"#
!uperscalar processors
!uperpipelined processors
$LI'
onexistence
Arra( processors
!(stolic arra(s
Associative processors
!hared-memor( multiprocessors
)us based
Crossbar s%itch based
"ultistage I based
"essage-passing multicomputers
*(percube
"esh
&econfigurable
Parallel Processing
Control
+nit
Processor
+nit
"emor(
Instruction stream
#ata stream
Characteristics
- !tandard von eumann machine
- Instructions and data are stored in memor(
- ,ne operation at a time
Limitations
$on eumann bottleneck
"aximum speed of the s(stem is limited b( the
Memory Bandwidth -bits.sec or b(tes.sec/
- Limitation on Memory Bandwidth
- "emor( is shared b( CP+ and I.,
Parallel Processing
" C+ P
" C+ P
" C+ P
0
0
0
0
0
0
"emor(
Instruction stream
#ata stream
Characteristics
- There is no computer at present that can be
classified as "I!#
Parallel Processing
Control +nit
"emor(
Alignment net%ork
P P P 0 0 0
" " " 0 0 0
#ata bus
Instruction stream
#ata stream
Processor units
"emor( modules
Characteristics
- ,nl( one cop( of the program exists
- A single controller executes one instruction at a time
Parallel Processing
Arra( Processors
- The control unit broadcasts instructions to all PEs1
and all active PEs execute the same instructions
- ILLIAC I$1 23-441 Connection "achine1 #AP1 "PP
!(stolic Arra(s
- &egular arrangement of a large number of
ver( simple processors constructed on
$L!I circuits
- C"+ 'arp1 Purdue C*iP
Associative Processors
- Content addressing
- #ata transformation operations over man( sets
of arguments %ith a single instruction
- !TA&A1 PEPE
Parallel Processing
Interconnection et%ork
P " P " P "
0 0 0
!hared "emor(
Characteristics
- "ultiple processing units
- Execution of multiple instructions on multiple data
T(pes of "I"# computer s(stems
- !hared memor( multiprocessors
- "essage-passing multicomputers
Parallel Processing
&4 A
i
1 &5 )
i
Load A
i
and )
i
&6 &4 7 &51 &8 C
i
"ultipl( and load C
i
&9 &6 : &8 Add
A techni;ue of decomposing a se;uential process
into suboperations1 %ith each subprocess being
executed in a partial dedicated segment that
operates concurrentl( %ith all other segments<
A
i
7 )
i
: C
i
for i = 41 51 61 <<< 1 >
A
i
&4 &5
"ultiplier
&6 &8
Adder
&9
"emor(
Pipelining
)
i
C
i
!egment 4
!egment 5
!egment 6
Clock
Pulse
!egment 4 !egment 5 !egment 6
umber &4 &5 &6 &8 &9
4 A4 )4
5 A5 )5 A4 7 )4 C4
6 A6 )6 A5 7 )5 C5 A4 7 )4 : C4
8 A8 )8 A6 7 )6 C6 A5 7 )5 : C5
9 A9 )9 A8 7 )8 C8 A6 7 )6 : C6
? A? )? A9 7 )9 C9 A8 7 )8 : C8
> A> )> A? 7 )? C? A9 7 )9 : C9
@ A> 7 )> C> A? 7 )? : C?
A A> 7 )> : C>
Pipelining
2eneral !tructure of a 8-!egment Pipeline
! &
4 4
! &
5 5
! &
6 6
! &
8 8
Input
Clock
!pace-Time #iagram
4 5 6 8 9 ? > @ A
T4
T4
T4
T4
T5
T5
T5
T5
T6
T6
T6
T6 T8
T8
T8
T8 T9
T9
T9
T9 T?
T?
T?
T?
Clock c(cles
!egment 4
5
6
8
Pipelining
nB umber of tasks to be performed
Conventional "achine -on-Pipelined/
t
n
B Clock c(cle

1
B Time re;uired to complete the n tasks

1
= n 7 t
n
Pipelined "achine -k stages/
t
p
B Clock c(cle -time to complete each suboperation/

B Time re;uired to complete the n tasks

= -k : n - 4/ 7 t
p
!peedup
!
k
B !peedup
!
k
= n7t
n
. -k : n - 4/7t
p


n
!
k
=
t
n
t
p
- = k1 if t
n
= k 7 t
p
/ lim
Pipelining
P
1
I
i
P
2
I
i+1
P
3
I
i+2
P
4
I
i+3
"ultiple 3unctional +nits
Example
- 8-stage pipeline
- subopertion in each stageC t
p
= 5Dn!
- 4DD tasks to be executed
- 4 task in non-pipelined s(stemC 5D78 = @Dn!

Pipelined !(stem
-k : n - 4/7t
p
= -8 : AA/ 7 5D = 5D?Dn!
on-Pipelined !(stem
n7k7t
p
= 4DD 7 @D = @DDDn!
!peedup
!
k
= @DDD . 5D?D = 6<@@
8-!tage Pipeline is basicall( identical to the s(stem
%ith 8 identical function units
Pipelining
!ix Phases7 in an Instruction C(cle
E4F 3etch an instruction from memor(
E5F #ecode the instruction
E6F Calculate the effective address of the operand
E8F 3etch the operands from memor(
E9F Execute the operation
E?F !tore the result in the proper place
7 !ome instructions skip some phases
7 Effective address calculation can be done in
the part of the decoding phase
7 !torage of the operation result into a register
is done automaticall( in the execution phase
==G 8-!tage Pipeline
E4F 3IB 3etch an instruction from memor(
E5F #AB #ecode the instruction and calculate
the effective address of the operand
E6F 3,B 3etch the operand
E8F EHB Execute the operation
Instruction Pipeline
Execution of Three Instructions in a 8-!tage Pipeline
Instruction Pipeline
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
i
i:4
i:5
Conventional
Pipelined
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
i
i:4
i:5
4 5 6 8 9 ? > @ A 4D 45 46 44
3I #A 3, EH 4
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
3I #A 3, EH
5
6
8
9
?
>
3I
!tepB
Instruction
-)ranch/
Instruction Pipeline
3etch instruction
from memor(
#ecode instruction
and calculate
effective address
)ranchI
3etch operand
from memor(
Execute instruction
InterruptI
Interrupt
handling
+pdate PC
Empt( pipe
no
(es
(es
no
!egment4B
!egment5B
!egment6B
!egment8B
!tructural haJards-&esource Conflicts/

*ard%are &esources re;uired b( the instructions in
simultaneous overlapped execution cannot be met
#ata haJards -#ata #ependenc( Conflicts/
An instruction scheduled to be executed in the pipeline re;uires the
result of a previous instruction1 %hich is not (et available
J"P I# PC : PC
bubble I3 I# ,3 ,E ,!
)ranch address dependenc(
Hazards in pipelines may make it
necessary to stall the pipeline
Pipeline Interlock:
Detect Hazards Stall until it is cleared
Instruction Pipeline
A## #A )1C :
IC #A :4 &4 bubble
#ata dependenc(
&4 K- ) : C
&4 K- &4 : 4
Control haJards
)ranches and other instructions that change the PC
make the fetch of the next instruction to be dela(ed
!tructural *aJards
,ccur %hen some resource has not been
duplicated enough to allo% all combinations
of instructions in the pipeline to execute
ExampleB 'ith one memor(-port1 a data and an instruction fetch
cannot be initiated in the same clock
The Pipeline is stalled for a structural haJard
K- T%o Loads %ith one port memor(
-G T%o-port memor( %ill serve %ithout stall
Instruction Pipeline
3I #A 3, EH
i
i:4
i:5
3I #A 3, EH
3I #A 3, EH stall stall
#ata *aJards
,ccurs %hen the execution of an instruction
depends on the results of a previous instruction
A## &41 &51 &6
!+) &81 &41 &9
*ard%are Techni;ue
Interlock
- hard%are detects the data dependencies and dela(s the scheduling
of the dependent instruction b( stalling enough clock c(cles
Forwarding -b(passing1 short-circuiting/
- Accomplished b( a data path that routes a value from a source
-usuall( an AL+/ to a user1 b(passing a designated register< This
allo%s the value to be produced to be used at an earlier stage in the
pipeline than %ould other%ise be possible
!oft%are Techni;ue
Instruction !cheduling-compiler/ for delayed load
#ata haJard can be dealt %ith either hard%are
techni;ues or soft%are techni;ue

Instruction Pipeline
&egister
file
&esult
%rite bus
)(pass
path
AL+ result buffer
"+H
AL+
&8
"+H
Instruction Pipeline
ExampleB
A## &41 &51 &6
!+) &81 &41 &9
6-stage Pipeline
IB Instruction 3etch
AB #ecode1 &ead &egisters1
AL+ ,perations
EB 'rite the result to the
destination register
I A E
A##
!+)
I A E
'ithout )(passing
I A E
!+) 'ith )(passing
a = b : cC
d = e - fC
+nscheduled codeB
#ela(ed Load
A load re;uiring that the follo%ing instruction not use its result
!cheduled CodeB
L' &b1 b
L' &c1 c
L' &e1 e
A## &a1 &b1 &c
L' &f1 f
!' a1 &a
!+) &d1 &e1 &f
!' d1 &d
L' &b1 b
L' &c1 c
A## &a1 &b1 &c
!' a1 &a
L' &e1 e
L' &f1 f
!+) &d1 &e1 &f
!' d1 &d
Instruction Pipeline
)ranch Instructions
- )ranch target address is not kno%n until
the branch instruction is completed
- !tall -G %aste of c(cle times
3I #A 3, EH
3I #A 3, EH
)ranch
Instruction
ext
Instruction
Target address available
#ealing %ith Control *aJards
7 Prefetch Target Instruction
7 )ranch Target )uffer
7 Loop )uffer
7 )ranch Prediction
7 #ela(ed )ranch
Instruction Pipeline
Instruction C(cles of Three-!tage Instruction Pipeline
RISC Pipeline
&I!C
- "achine %ith a ver( fast clock c(cle that
executes at the rate of one instruction per c(cle
K- !imple Instruction !et
3ixed Length Instruction 3ormat
&egister-to-&egister ,perations
#ata "anipulation Instructions
IB Instruction 3etch
AB #ecode1 &ead &egisters1 AL+ ,perations
EB 'rite a &egister
Load and !tore Instructions
IB Instruction 3etch
AB #ecode1 Evaluate Effective Address
EB &egister-to-"emor( or "emor(-to-&egister

Program Control Instructions
IB Instruction 3etch
AB #ecode1 Evaluate )ranch Address
EB 'rite &egister-PC/
Three-segment pipeline timing
Pipeline timing %ith data conflict
clock c(cle 4 5 6 8 9 ?
Load &4 I A E
Load &5 I A E
Add &4:&5 I A E
!tore &6 I A E
Pipeline timing %ith dela(ed load
clock c(cle 4 5 6 8 9 ? >
Load &4 I A E
Load &5 I A E
,P I A E
Add &4:&5 I A E
!tore &6 I A E

L,A#B &4 "Eaddress 4F
L,A#B &5 "Eaddress 5F
A##B &6 &4 : &5
!T,&EB "Eaddress 6F &6
RISC Pipeline
The data dependenc( is taken
care b( the compiler rather
than the hard%are
1
I
3 4 6 5 2 Clock cycles:
1. Load A
2. Increment
4. Subtract
5. Branch to X
7
3. Add
8
6. NOP
E
I A E
I A E
I A E
I A E
I A E
9 10
7. NOP
8. Instr. in X
I A E
I A E
1
I
3 4 6 5 2 Clock cycles:
1. Load A
2. Increment
4. Add
5. Subtract
7
3. Branch to X
8
6. Instr. in X
E
I A E
I A E
I A E
I A E
I A E
Compiler anal(Jes the instructions before and after
the branch and rearranges the program se;uence b(
inserting useful instructions in the dela( steps
+sing no-operation instructions
&earranging the instructions
RISC Pipeline
ector Processin! "pplications

Pro#lems that can #e efficiently formulated in terms of $ectors


%on!&ran!e 'eather forecastin!
Petroleum e(plorations
Seismic data analysis
)edical dia!nosis
"erodynamics and space fli!ht simulations
"rtificial intelli!ence and e(pert systems
)appin! the human !enome
Ima!e processin!
ector Processor *computer+
"#ility to process $ectors, and related data structures such as matrices
and multi&dimensional arrays, much faster than con$entional computers
ector Processors may also #e pipelined
ector Processing
#, 5D I = 41 4DD
5D C-I/ = )-I/ : A-I/
Conventional computer
InitialiJe I = D
5D &ead A-I/
&ead )-I/
!tore C-I/ = A-I/ : )-I/
Increment I = i : 4
If I 4DD goto 5D
$ector computer
C-4B4DD/ = A-4B4DD/ : )-4B4DD/
ector Processing
f4B $ $
f5B $ !
f6B $ x $ $
f8B $ x ! $
$B $ector operand
!B !calar operand
T(pe "nemonic #escription -I = 41 <<<1 n/
ector Processing
f4 $!L& $ector s;uare root )-I/ !L&-A-I//
$!I $ector sine )-I/ sin-A-I//
$C," $ector complement A-I/ A-I/
f5 $!+" $ector summation ! A-I/
$"AH $ector maximum ! maxMA-I/N
f6 $A## $ector add C-I/ A-I/ : )-I/
$"PO $ector multipl( C-I/ A-I/ 7 )-I/
$A# $ector A# C-I/ A-I/ < )-I/
$LA& $ector larger C-I/ max-A-I/1)-I//
$T2E $ector test G C-I/ D if A-I/ K )-I/
C-I/ 4 if A-I/ G )-I/
f8 !A## $ector-scalar add )-I/ ! : A-I/
!#I$ $ector-scalar divide )-I/ A-I/ . !
Operation
code
Baseaddress
source1
Baseaddress
source2
Baseaddress
destination
Vector
length
ector Processing
$ector Instruction 3ormat
Source
A
Source
B
Multiplier
pipeline
Adder
pipeline
Pipeline for Inner Product
ector Processing
"ultiple "odule "emor(
Address Interleaving

#ifferent sets of addresses are assigned to
different memor( modules
A&
"emor(
arra(
#&
A&
"emor(
arra(
#&
A&
"emor(
arra(
#&
A&
"emor(
arra(
#&
Address bus
#ata bus
"D "4 "5 "6

You might also like