Parallel Processing: sp2016 Lec#3
Parallel Processing: sp2016 Lec#3
sp2016
lec#3
Dr M Shamim Baig
1.1
1.2
Pipeline Performance
Instruction & Arithmetic-unit Pipeline
Ideal pipeline Speed-up calculation & Limits
Chained Pipeline Performance
The speed-up of a pipeline is eventually limited by the
number of stages & time of slowest stage.
For this reason, conventional processors tried on very
deep-pipeline (20 stage pipeline is an example of deep
pipeline compared to normal pipeline of 3-6 stages)
1.3
1.5
Superscalar Processor
One simple way of alleviating the deep pipeline
bottlenecks is to use multiple (concurrent) short
pipelines.
Issue multiple independent instructions
simultaneously
Examples: MIPS1000, PowerPC & Pentium
Superscalar Scheduler
Superscalar scheduler is in-chip hardware that
looks at number of instructions in an instruction
queue at runtime & selects appropriate number
of instructions to execute concurrently.
Scheduling of instructions concurrently is
determined by a number of factors:
Resolve Data Dependency Issues
Resolve Resource Constraint Issues
Resolve Branch Prediction Issues
1.7
// OF not required
IF
ID
NA
NA
WB
Execution Unit constraint or data-dependency can cause additional delays than Ideal pipeline
// OF not required
IF
ID
NA
NA
WB
1.9
Superscalar Execution:
Efficiency Considerations
Not all functional units can be kept busy at all times.
If during a cycle, no functional units are utilized, this is
referred to as vertical waste.
If during a cycle, only some of the functional units are
utilized, this is referred to as horizontal waste.
Due to limited parallelism in typical instruction traces
(dependencies) & limited time/scope of the scheduler
to extract parallelism, the performance of superscalar
processors is eventually limited.
Conventional microprocessors typically support fourway superscalar execution.
1.10
Superscalar Execution:
Instruction Issue Mechanisms
In the simpler model, instructions can be issued
only in the order in which they are encountered
i.e if the second instruction cannot be issued
because it has a data dependency with the first,
only one instruction is issued in the cycle.
This is called in-order issue.
In a more aggressive model, instructions can be
issued out of order. In this case, if the second
instruction has data dependencies with the first,
but the third instruction does not, the first and
third instructions can be co-scheduled.
This is also called dynamic issue.
Performance of in-order issue is generally limited
1.11
1.12
1.13
Comparison: Superscalar vs
Very Long Instruction Word (VLIW)
Superscalar implements Scheduler as in-chip Hardware,
while VLIW implements it in compiler software.
Superscalar schedules concurrent instructions at runtime,
while VLIW does it at compile-time.
Superscalar scheduler scope is limited to few instructions
from instruction-queue while VLIW scheduler has bigger
context (may be full program) to process.
Due to more time & context VLIW scheduler can use
more powerful algorithms (eg loop unrolling, branch prediction
etc) giving better results, which Superscalar cant afford
Compilers, however, do not have runtime information (eg
cache misses, branch variable state etc), so VLIW Scheduling is
inherently more conservative than Superscalar
1.15
1.16
DS3
IS
MEMORY
DS2
IS2
DS1
DS2
MEMORY
IS1
DS1
Isn-1
DSn-1
DSn-1
DSn
ISn
DSn
SIMD Processors
Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to
this class of machines.
Variants of this concept have found use in co-processing
units such as the MMX units in Intel processors,
DSP
chips such as the Sharc & Vividias GPUs.
SIMD relies on the regular structure of computations (such
as those in image processing).
It is often necessary to selectively turn off operations on
certain data items. For this reason, most SIMD
programming paradigms allow for an ``activity mask'',
which determines if a processor should participate in a
computation or not.
1.20
1.21
1.22