DATA LEVEL PARALLELISM IN
SMID ANDVECTOR AND GPU
BY
19PW40
S.Sayana
DATA PARALLELISM
Data level parallelism (also
known as loop level parallelism)
is a form of parallel computing for
multiple processors using a
technique for distributing the
data across different parallel
processor nodes.
LOOP LEVEL PARALLELISM
Loop level parallelism is a form of parallelism
that is concerned with extracting parallel
tasks from loops.
This often arises in programs where data is
often stored in random access data
structures.
It uses multiple processes which will iterate
over data structure and operate on some or
all of the indices at the same time.
It provides a speed up to overall execution
time of the program.
LOOP LEVEL PARALLELISM
The simplest and most common way
to increase the amount of parallelism
available among instructions is to
exploit parallelism among iterations
of a loop.This type of parallelism is
often called loop level parallelism.
DEPENDENCE IN LOOPS
Example 1:
for (i=1; i<=1000; i= i+1)
x[i] = x[i] + y[i];
This is a parallel loop. Every iteration of
the loop can overlap with any other
iteration, although within each loop
iteration there is little opportunity for
overlap.
Example 2
for (i=1; i<=100; i= i+1){
a[i] = a[i] + b[i]; //s1
b[i+1] = c[i] + d[i]; //s2
}
Is this loop parallel? If not how to make it parallel?
Despite dependency this loop can be made parallel as the
dependency is not circular.
LOOP CARRIED DEPENDENCY
When a statement in one iteration of a loop
depends in some way on a statement in a
different iteration of the same loop,a loop-
carried dependance exists.
If a statement in one iteration of a loop
depends only on a statement in the same
iteration of the loop, this creates a loop
independent dependence.
CIRCULAR DEPENDANCY
Neither statements depends on itself.
When s1 depends on s2 and s2
depends on s1.
A loop is parallel unless there is a cycle in the
dependecies, since the absence of a cycle means that
the dependencies give a partial ordering on the
statements.
This allows us to replace the loop above with the following code
sequence :
a[1] = a[1] + b[1];
for (i=1; i<=99; i= i+1){
b[i+1] = c[i] + d[i];
a[i+1] = a[i+1] + b[i+1];
}
b[101] = c[100] + d[100];
Example 3:
for (i=1; i<=100; i= i+1){
a[i+1] = a[i] + c[i]; //S1
b[i+1] = b[i] + a[i+1]; //S2
}
This loop is not parallel because it has cycles in the dependencies,
namely the statements S1 and S2 depend on themselves!
There are a number of techniques for converting such loop-level
parallelism into instruction-level parallelism. Basically, such
techniques work by unrolling the loop.
DEPENDENCIES IN CODE
There are many types of dependencies.They are
In order to preserve the sequential behaviour of a loop when run in
parallel, True Dependence must be preserved. Anti-Dependence and
Output Dependence can be dealt with by giving each process its own
copy of variables
EXAMPLES
Example of true dependence
S1: int a, b;
S2: a = 2;
S3: b = a + 40;
S2 ->T S3, meaning that S2 has a true
dependence on S3 because S2 writes to the
variable a, which S3 reads from.
EXAMPLES
Example of anti-dependence
S1: int a, b = 40;
S2: a = b - 38;
S3: b = -1;
S2 ->A S3, meaning that S2 has an anti-
dependence on S3 because S2 reads from
the variable b before S3 writes to it.
EXAMPLES
Example of output-dependence
S1: int a, b = 40;
S2: a = b - 38;
S3: a = 2;
S2 ->O S3, meaning that S2 has an output
dependence on S3 because both write to the
variable a.
EXAMPLES
Example of input-dependence
S1: int a, b, c = 2;
S2: a = c - 1;
S3: b = c + 1;
S2 ->I S3, meaning that S2 has an input
dependence on S3 because S2 and S3 both
read from variable c.
CHAINING,CONVOYS AND
CHIMES
Chaining allows the results of one vector operation to be
directly used as input to another vector operation.
A convoy is a set of vector instructions that can potentially
execute together. Only structural hazards cause separate
convoys as true dependences are handled via chaining in
the same convoy
A chime is the unit of time taken to execute one convoy,
which is the vector length along with the startup cost.
The following VMIPS code executes in three chimes since there are
three convoys.
/* VMIPS code */ /* convoys */
LV V1,Rx 1. LV V1,Rx
MULVS.D V2,V1,F0 MULVS.D V2,V1,F0
LV V3, Ry 2. LV V3, Ry
ADDVV. D V4,V2,V3 ADDVV.D V4,V2,V3