0% found this document useful (0 votes)
116 views

Parallel Processing: sp2016 Lec#3

The document discusses different types of parallel processing architectures, including implicit and explicit parallelism. It describes pipeline processors, superscalar processors, and VLIW processors as forms of implicit parallelism that exploit instruction level parallelism. It then covers explicit parallel architectures like SIMD and MIMD, explaining their classification based on instruction and data streams. Key aspects like programming models, performance limitations, and comparisons between SIMD and MIMD are summarized.

Uploaded by

RohFollower
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Parallel Processing: sp2016 Lec#3

The document discusses different types of parallel processing architectures, including implicit and explicit parallelism. It describes pipeline processors, superscalar processors, and VLIW processors as forms of implicit parallelism that exploit instruction level parallelism. It then covers explicit parallel architectures like SIMD and MIMD, explaining their classification based on instruction and data streams. Key aspects like programming models, performance limitations, and comparisons between SIMD and MIMD are summarized.

Uploaded by

RohFollower
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Parallel Processing

sp2016
lec#3
Dr M Shamim Baig

1.1

Implicit Parallel Architectures:


ILP processors
Pipelined Processors
Superscalar Processor
VLIW Processor

1.2

Pipeline Performance
Instruction & Arithmetic-unit Pipeline
Ideal pipeline Speed-up calculation & Limits
Chained Pipeline Performance
The speed-up of a pipeline is eventually limited by the
number of stages & time of slowest stage.
For this reason, conventional processors tried on very
deep-pipeline (20 stage pipeline is an example of deep
pipeline compared to normal pipeline of 3-6 stages)

1.3

Pipeline Performance Bottlenecks


Pipeline has following performance bottlenecks
Resource Constraint
Data Dependency
Branch Prediction
Approx every 5-6th instruction is a conditional jump! This
requires very accurate branch prediction.
The penalty of a prediction error grows with the depth of
the pipeline, since a larger number of instructions will
have to be flushed.
Hence need for better solutions (than deep pipeline) ????
1.4

Implicit Parallel Architectures:


ILP processors
Pipelined Processor
Superscalar Processor
VLIW Processor

1.5

Superscalar Processor
One simple way of alleviating the deep pipeline
bottlenecks is to use multiple (concurrent) short
pipelines.
Issue multiple independent instructions
simultaneously
Examples: MIPS1000, PowerPC & Pentium

The question then becomes one of selecting or


scheduling these instructions for simultaneous
issuing.
1.6

Superscalar Scheduler
Superscalar scheduler is in-chip hardware that
looks at number of instructions in an instruction
queue at runtime & selects appropriate number
of instructions to execute concurrently.
Scheduling of instructions concurrently is
determined by a number of factors:
Resolve Data Dependency Issues
Resolve Resource Constraint Issues
Resolve Branch Prediction Issues

Cost/ complexity of Scheduler hardware & its


performance constraints (discussed later) are
important issues of superscalar processors.

1.7

Example: two-way superscalar execution of instructions

// OF not required
IF

ID

NA

NA

WB

// OF & E not required

Execution Unit constraint or data-dependency can cause additional delays than Ideal pipeline

The example illustrates that different instruction mixes with


identical semantics can take significantly different execution time
1.8

Superscalar Execution: Resource Waste


In the above example, there is some wastage of Execution
unit resource

// OF not required

IF

ID

NA

NA

WB

// OF & E not required

1.9

Superscalar Execution:
Efficiency Considerations
Not all functional units can be kept busy at all times.
If during a cycle, no functional units are utilized, this is
referred to as vertical waste.
If during a cycle, only some of the functional units are
utilized, this is referred to as horizontal waste.
Due to limited parallelism in typical instruction traces
(dependencies) & limited time/scope of the scheduler
to extract parallelism, the performance of superscalar
processors is eventually limited.
Conventional microprocessors typically support fourway superscalar execution.
1.10

Superscalar Execution:
Instruction Issue Mechanisms
In the simpler model, instructions can be issued
only in the order in which they are encountered
i.e if the second instruction cannot be issued
because it has a data dependency with the first,
only one instruction is issued in the cycle.
This is called in-order issue.
In a more aggressive model, instructions can be
issued out of order. In this case, if the second
instruction has data dependencies with the first,
but the third instruction does not, the first and
third instructions can be co-scheduled.
This is also called dynamic issue.
Performance of in-order issue is generally limited
1.11

Implicit Parallel Architectures:


ILP processors
Pipelined Processor
Superscalar Processor
VLIW Processor

1.12

Very Long Instruction Word (VLIW)


Processors
Hardware cost /complexity & time/ scope constraint
of runtime scheduling of the superscalar are the
major issues in superscalar design.

To address these issues, VLIW processors rely on


compile time analysis to identify & bundle together
instructions that can be executed concurrently
These instructions are packed & dispatched together
& thus the name very long instruction word
Typical VLIW processors are limited to 4 to 8-way
parallelism. Variants of this concept are employed
in Intel IA64 processors & TI TMS320 C6XXX DSPs

1.13

A high performance DSP:


8-way VLIW processor

TMS320C6x has dual data paths


& orthogonal instruction units
which boost overall performance

Comparison: Superscalar vs
Very Long Instruction Word (VLIW)
Superscalar implements Scheduler as in-chip Hardware,
while VLIW implements it in compiler software.
Superscalar schedules concurrent instructions at runtime,
while VLIW does it at compile-time.
Superscalar scheduler scope is limited to few instructions
from instruction-queue while VLIW scheduler has bigger
context (may be full program) to process.
Due to more time & context VLIW scheduler can use
more powerful algorithms (eg loop unrolling, branch prediction
etc) giving better results, which Superscalar cant afford
Compilers, however, do not have runtime information (eg
cache misses, branch variable state etc), so VLIW Scheduling is
inherently more conservative than Superscalar
1.15

Explicitly Parallel Processor


architectures:
Task-level Parallelism

1.16

Elements of (Explicit) Parallel


Architectures
Processor configurations:
Instruction/Data Stream based
Memory Configurations:
- Physical & Logical based
- Access-Delay based
Inter-processor communication:
Communication-Interface design
- Data Exchange/ Synch approach
1.17

Flynns Classification for


Parallel Processor Architecture
Instruction Stream & Data Streams based
classification (SISD, MISD, SIMD, MIMD)
Processing units in parallel computers either
operate under the centralized control of a
single control unit or work independently.
If there is a single control unit that dispatches
the same instruction to various processors
(that work on different data), the model is
referred to as single instruction stream,
multiple data stream (SIMD).
If each processor has its own control unit,
each processor can execute different
instructions on different data items. This model
is called multiple instruction stream, multiple
data stream (MIMD).
1.18

SIMD and MIMD Processors

DS3

IS

MEMORY

DS2

IS2

DS1

DS2

MEMORY

IS1

DS1

Isn-1
DSn-1

DSn-1
DSn

ISn
DSn

A typical SIMD architecture (a) and a typical MIMD architecture (b).


1.19

SIMD Processors
Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to
this class of machines.
Variants of this concept have found use in co-processing
units such as the MMX units in Intel processors,
DSP
chips such as the Sharc & Vividias GPUs.
SIMD relies on the regular structure of computations (such
as those in image processing).
It is often necessary to selectively turn off operations on
certain data items. For this reason, most SIMD
programming paradigms allow for an ``activity mask'',
which determines if a processor should participate in a
computation or not.
1.20

Ex: Conditional Execution in SIMD Processors

Executing a conditional statement on an SIMD computer with four processors:


(a) the conditional statement; (b) the execution of the statement in two steps.

1.21

Programing Models: MPMD/ SPMD


In contrast to SIMD processors, MIMD processors can
execute different programs on different processors
There are two programming models for PP called
Multiple/Single Program Multiple-Data (MPMD/ SPMD)
execute different/same program on different processors
SIMD supports only SPMD model. Although MIMD
supports both models of programming (MPMD & SPMD),
SPMD is preferred choice due to software management
Examples of MIMD-platforms include current generation
Sun Ultra Servers, SGI Origin Servers, multiprocessor
PCs, workstation clusters & IBM SP.

1.22

Comparison: SIMD vs MIMD


Control flow:
Synchronous in SIMD vs Asynchronous in MIMD
Programming-model:SIMD supports only SPMD prog-model
while MIMD supports both (SPMD & MPMD) prog-models
Cost: SIMD computers require less hardware than
MIMD computers (single control unit).
However, since SIMD processors are specially
designed, they tend to be expensive and have long
design cycles.
In contrast, MIMD processors can be built from
inexpensive off-the-shelf components with relatively
little effort in a short time
Flexibility: SIMD perform very well for specialized /
regular applications but Not for all applications, while
MIMD are more flexible & general purpose.
1.23

You might also like