SIMD
Single instruction, multiple
data (SIMD)
Contents
Parallel Processors
Flynn's taxonomy
What is SIMD?
Types of Processing
Scalar Processing
Vector Processing
Architecture for Vector Processing
Vector processors
Vector Processor Architectures
Components of Vector Processors
Advantages of Vector Processing
Array processors
Array Processor Classification
Array Processor Architecture
Dedicated Memory Organization
Global Memory Organization
ILLIAC IV
ILLIAC IV Architecture
Super Computers
Cray X1
Multimedia Extension
Parallel Processors
In computers, parallel processing is the processing
of program instructions by dividing them among
multiple processors with the objective of running a program in
less time.
In the earliest computers, only one program ran at a time. A
computation-intensive program that took one hour to run and a
tape copying program that took one hour to run would take a
total of two hours to run. An early form of parallel processing
allowed the interleaved execution of both programs together.
The computer would start an I/O operation, and while it was
waiting for the operation to complete, it would execute the
processor- intensive program. The total execution time for the
two jobs would be a little over one hour.
Flynn's taxonomy
Flynn's taxonomy is a classification of
computer architectures, proposed by
Michael J. Flynn in 1966.The
classification system has stuck, and
has been used as a tool in design of
modern processors and their
functionalities.
Classification
The four classifications defined by Flynn are based
upon the number of concurrent instruction (or control)
streams and data streams available in the architecture.
Single instruction stream single data stream (SISD)
Single instruction stream, multiple data streams
(SIMD)
Single instruction, multiple threads (SIMT)
Multiple instruction streams, single data stream
(MISD).
Evolution of Intel Vector Instructions
■ MMX (1996, Pentium)
CPU-based MPEG decoding
Integers only, 64-bit divided into 2 x 32 to 8 x 8
Phased out with SSE4
■ SSE (1999, Pentium III)
CPU-based 3D graphics
4-way float operations, single precision
8 new 128 bit Register, 100+ instructions
■ SSE2 (2001, Pentium 4)
High-performance computing
Adds 2-way float ops, double-precision; same registers as 4-way single-precision
Integer SSE instructions make MMX obsolete
■ SSE3 (2004, Pentium 4E Prescott)
Scientific computing
New 2-way and 4-way vector instructions for complex
arithmetic
■ SSSE3 (2006, Core Duo)
Minor advancement over SSE3
■ SSE4 (2007, Core2 Duo Penryn)
Modern codecs, cryptography
New integer instructions
Better support for unaligned data, super shuffle engine
What is SIMD?
Single instruction, multiple data (SIMD), is a class
of parallel computers in Flynn's taxonomy.
It describes computers with multiple processing
elements that perform the same operation on multiple
data points simultaneously. Thus, such machines
exploit data level parallelism.
There are simultaneous (parallel) computations, but
only a single process (instruction) at a given moment.
How SIMD processes?
Processing/Working
Types of Processing
Scalar Processing
A CPU that performs computations on one number or
set of data at a time. A scalar processor is known as a
"single instruction stream single data stream" (SISD)
CPU.
Vector Processing
A vector processor or array processor is a
central processing unit (CPU) that implements an
instruction set containing instructions that operate on
1-D arrays of data called vectors.
Architecture for Vector Processing
Two architectures suitable for vector processing are:
Pipelined vector processors
Parallel Array processors
Pipelined vector processors
CPU that implements an instruction set that
operates on 1-D arrays, called vectors
Vectors contain multiple data elements
Number of data elements per vector is typically
referred to as the vector length
Both instructions and data are pipelined to reduce
decoding time
Advantages of Vector Processing
Advantages:
Quick fetch and decode of a single instruction for multiple
operations.
The instruction provides a regular source of data, which
arrive at
each cycle, and can be processed in a pipelined fashion
efficiently.
Easier Addressing of Main Memory
Elimination of Memory Wastage
Simplification of Control Hazards
Reduced Code Size
Array Processors
ARRAY processor is a processor that performs
computations on a large array of data.
Array processor is a synchronous parallel
computer with multiple ALU called processing
elements ( PE) that can operate in parallel in
lockstep fashion.
It is composed of N identical PE under the control
of a single control unit and a number of memory
modules
Array Processor Classification
SIMD ( Single Instruction Multiple Data )
is an array processor that has a single instruction
multiple data organization.
It manipulates vector instructions by means of multiple
functional unit responding to a common instruction.
Attached array processor
is an auxiliary processor attached to a general purpose
computer.
Its intent is to improve the performance of the host
computer in specific numeric calculation tasks.
SIMD-Array Processor Architecture
SIMD has two basic configuration
Array processors using RAM also known as
( Dedicated memory organization ).
• ILLIAC-IV, CM-2,MP-1
Associative processor using content accessible
memory also known as
( Global Memory Organization)
• BSP
MMX
Multi Media Extensions
Development
MMX (Multimedia Extension) was introduced in
1996 (Pentium with MMX and Pentium II).
SSE (Streaming SIMD Extension) was introduced
with Pentium III.
SSE2 was introduced with Pentium 4.
SSE3 was introduced with Pentium 4 supporting
hyper-threading technology. SSE3 adds 13 more
instructions.
MMX
After analyzing a lot of existing applications such as
graphics, MPEG, music, speech recognition, game,
image processing, they found that many multimedia
algorithms execute the same instructions on many
pieces of data in a large data set.
Typical elements are small, 8 bits for pixels, 16 bits
for audio, 32 bits for graphics and general computing.
New data type: 64-bit packed data type. Why 64 bits?
Good enough
Practical
Data Types of MMX
The four MMX technology data types are:
Packed byte -- Eight bytes packed into one 64-bit
quantity.
Packed word -- Four 16-bit words packed into one
64-bit quantity.
Packed doubleword -- Two 32-bit double words
packed into one 64-bit quantity.
Quadword -- One 64-bit quantity.
Compatibility
To be fully compatible with existing IA, no new mode
or state was created. Hence, for context switching, no
extra state needs to be saved.
To reach the goal, MMX is hidden behind FPU. When
floating-point state is saved or restored, MMX is
saved or restored.
It allows existing OS to perform context switching on
the processes executing MMX instruction without be
aware of MMX.
However, it means MMX and FPU can not be used at
the same time. Big overhead to switch.
Although Intel defenses their decision on aliasing MMX to
FPU for compatibility. It is actually a bad decision. OS can
just provide a service pack or get updated.
It is why Intel introduced SSE later without any aliasing.
Saturation Arithmetic
In an 8-bit grayscale picture, 255 is the value for pure white, and 0 is the
value for pure black. In a regular register (AX, BX, CX ...) if we add one
to white, we get black! This is because the regular registers "roll-over" to
the next value. MMX registers get around this by a technique called
"Saturation Arithmetic". In saturation arithmetic, the value of the register
never rolls over to 0 again. This means that in the MMX world, we have
the following equations:
255 + 100 = 255
200 + 100 = 255
0 - 100 = 0;
99 - 100 = 0
This may seem counter-intuitive at first to people who are used to their
registers rolling over, but it makes sense in some situations: if we try to
make white brighter, it shouldn't become black.
MMX Registers
MMX defines eight registers, called MM0
through MM7, and operations that operate
on them. Each register is 64 bits wide and
can be used to hold either 64-bit integers, or
multiple smaller integers in a "packed"
format: a single instruction can then be
applied to two 32-bit integers, four 16-bit
integers, or eight 8-bit integers at once.
Instructions
The MMX registers are 64 bits wide, but can be broken down as
follows:
2 32 bit values 4 16 bit values 8 8 bit values The MMX registers cannot
easily be used for 64 bit arithmetic. Let's say that we have 4 bytes
loaded in an MMX register: 10, 25, 128, 255. We have them arranged
as such:
MM0: | 10 | 25 | 128 | 255 |
And we do the following pseudo code operation:
MM0 + 10
We would get the following result:
MM0: | 10+10 | 25+10 | 128+10 | 255+10 | = | 20 | 35 | 138 | 255 |
Remember that our arithmetic "saturates" in the last box, so the value
doesn't go over 255.
Using MMX, we are essentially performing 4 additions in the time it
takes to perform 1 addition using the regular registers, using 4 times
fewer instructions.
MMX Instructions