Parallel Processing
What is Parallel Processing?
– It is the technique which provides simultaneous data processing for the purpose of
increasing the computational speed of a computer system.
– parallel can be achieved by multiple functional unit that perform identical or different
operations simultaneously.
-In computers, parallel processing is the processing of program instructions by dividing
them among multiple processors with the objective of running a program in less time.
-The simultaneous use of more than one CPU to execute a program. Ideally, parallel
processing makes a program run faster because there are more engines (CPUs) running
it. In practice, it is often difficult to divide a program in such a way that separate CPUs
can execute different portions without interfering with each other.
A. Multiple Processor Organization
• Single instruction, single data stream – SISD
• Single instruction, multiple data stream – SIMD
• Multiple instruction, single data stream – MISD
• Multiple instruction, multiple data stream- MIMD
Single Instruction, Single Data Stream – SISD
• Single processor
• Single instruction stream
• Data stored in single memory
• Uni-processor
Single Instruction, Multiple Data Stream – SIMD
• Single machine instruction
• Controls simultaneous execution
• Number of processing elements
• Lockstep basis
• Each processing element has associated data memory
• Each instruction executed on different set of data by different processors
• Vector and array processors
Multiple Instruction, Single Data Stream – MISD
• Sequence of data
• Transmitted to set of processors
• Each processor executes different instruction sequence
• Never been implemented
Multiple Instruction, Multiple Data Stream- MIMD
• Set of processors
• Simultaneously execute different instruction sequences
• Different sets of data
• SMPs, clusters and NUMA systems
Taxonomy of Parallel Processor Architecture
MIMD – Overview
• ž General purpose processors
• ž Each can process all instructions necessary
• Further classified by method of processor communication
Tightly Coupled – SMP
• ž Processors share memory
• ž Communicate via that shared memory
• ž Symmetric Multiprocessor (SMP)
• —Share single memory or pool
• —Shared bus to access memory
• —Memory access time to given area of memory is approximately the same for each
processor
Tightly Coupled – NUMA
• ž Non uniform memory access
• ž Access times to different regions of memory may differ
Loosely Coupled – Clusters
• ž Collection of independent uni processors or SMPs
• ž Interconnected to form a cluster
• ž Communication via fixed path or network connections
Parallel Organizations – SISD
Parallel Organizations – SIMD
Parallel Organizations – MIMD Shared Memory
Parallel Organizations – MIMD
Distributed Memory
B. Symmetric Multiprocessors
• ž A stand alone computer with the following characteristics
• —Two or more similar processors of comparable capacity
• —Processors share same memory and I/O
• —Processors are connected by a bus or other internal connection
• —Memory access time is approximately the same for each processor
• —All processors share access to I/O
• ○Either through same channels or different channels giving paths to same devices
• —All processors can perform the same functions (hence symmetric)
• —System controlled by integrated operating system
• ○providing interaction between processors
• ○Interaction at job, task, file and data element levels
Multiprogramming and Multiprocessing
SMP Advantages
• ž Performance
• —If some work can be done in parallel
• ž Availability
• —Since all processors can perform the same functions, failure of a single processor
does not halt the system
• ž Incremental growth
• —User can enhance performance by adding additional processors
• ž Scaling
• —Vendors can offer range of products based on number of processors
Block Diagram of Tightly Coupled Multiprocessor
Organization Classification
• ž Time shared or common bus
• ž Multiport memory
• ž Central control unit
Time Shared Bus
• ž Simplest form
• ž Structure and interface similar to single processor system
• ž Following features provided
• —Addressing – distinguish modules on bus
• —Arbitration – any module can be temporary master
• —Time sharing – if one module has the bus, others must wait and may have to
suspend
• ž Now have multiple processors as well as multiple I/O modules
Symmetric Multiprocessor Organization
Time Share Bus – Advantages
• ž Simplicity
• ž Flexibility
• ž Reliability
Time Share Bus – Disadvantage
• ž Performance limited by bus cycle time
• ž Each processor should have local cache
• —Reduce number of bus accesses
• ž Leads to problems with cache coherence
• —Solved in hardware – see later
Operating System Issues
• ž Simultaneous concurrent processes
• ž Scheduling
• ž Synchronization
• ž Memory management
• ž Reliability and fault tolerance
A Mainframe SMP
IBM zSeries
• ž Uniprocessor with one main memory card to a high-end system with 48
processors and 8 memory cards
• ž Dual-core processor chip
• —Each includes two identical central processors (CPs)
• —CISC superscalar microprocessor
• —Mostly hardwired, some vertical microcode
• —256-kB L1 instruction cache and a 256-kB L1 data cache
• ž L2 cache 32 MB
• —Clusters of five
• —Each cluster supports eight processors and access to entire main memory space
• ž System control element (SCE)
• —Arbitrates system communication
• —Maintains cache coherence
• ž Main store control (MSC)
• —Interconnect L2 caches and main memory
• ž Memory card
• —Each 32 GB, Maximum 8 , total of 256 GB
• —Interconnect to MSC via synchronous memory interfaces (SMIs)
• ž Memory bus adapter (MBA)
• —Interface to I/O channels, go directly to L2 cache
IBM z990 Multiprocessor Structure
C. Cache Coherence and
MESI Protocol
• ž Problem – multiple copies of same data in different caches
• ž Can result in an inconsistent view of memory
• ž Write back policy can lead to inconsistency
• ž Write through can also give problems unless caches monitor memory traffic
Cache Coherence
• ž Node 1 directory keeps note that node 2 has copy of data
• ž If data modified in cache, this is broadcast to other nodes
• ž Local directories monitor and purge local cache if necessary
• ž Local directory monitors changes to local data in remote caches and marks
memory invalid until writeback
• ž Local directory forces writeback if memory location requested by another
processor
Software Solutions
• ž Compiler and operating system deal with problem
• ž Overhead transferred to compile time
• ž Design complexity transferred from hardware to software
• ž However, software tends to make conservative decisions
• —Inefficient cache utilization
• ž Analyze code to determine safe periods for caching shared variables
Hardware Solution
• ž Cache coherence protocols
• ž Dynamic recognition of potential problems
• ž Run time
• ž More efficient use of cache
• ž Transparent to programmer
• ž Directory protocols
• ž Snoopy protocols
Directory Protocols
• ž Collect and maintain information about copies of data in cache
• ž Directory stored in main memory
• ž Requests are checked against directory
• ž Appropriate transfers are performed
• ž Creates central bottleneck
• ž Effective in large scale systems with complex interconnection schemes
Snoopy Protocols
• ž Distribute cache coherence responsibility among cache controllers
• ž Cache recognizes that a line is shared
• ž Updates announced to other caches
• ž Suited to bus based multiprocessor
• ž Increases bus traffic
ž Write Invalidate
• ž Multiple readers, one writer
• ž When a write is required, all other caches of the line are invalidated
• ž Writing processor then has exclusive (cheap) access until line required by
another processor
• ž Used in Pentium II and PowerPC systems
• ž State of every line is marked as modified, exclusive, shared or invalid
• ž MESI
Write Update
• ž Multiple readers and writers
• ž Updated word is distributed to all other processors
• ž Some systems use an adaptive mixture of both solutions
MESI State Transition Diagram
Increasing Performance
• ž Processor performance can be measured by the rate at which it executes
instructions
• ž MIPS rate = f * IPC
• —f processor clock frequency, in MHz
• —IPC is average instructions per cycle
• ž Increase performance by increasing clock frequency and increasing instructions
that complete during cycle
• ž May be reaching limit
• —Complexity
• —Power consumption
Multithreading and Chip Multiprocessors
• ž Instruction stream divided into smaller streams (threads)
• ž Executed in parallel
• ž Wide variety of multithreading designs
Definitions of Threads and Processes
• ž Thread in multithreaded processors may or may not be same as software threads
• ž Process:
• —An instance of program running on computer
• —Resource ownership
• ○Virtual address space to hold process image
• —Scheduling/execution
• —Process switch
• ž Thread: dispatchable unit of work within process
• —Includes processor context (which includes the program counter and stack
pointer) and data area for stack
• —Thread executes sequentially
• —Interruptible: processor can turn to another thread
• ž Thread switch
• —Switching processor between threads within same process
• —Typically less costly than process switch
Implicit and Explicit Multithreading
• ž All commercial processors and most experimental ones use explicit
multithreading
• —Concurrently execute instructions from different explicit threads
• —Interleave instructions from different threads on shared pipelines or parallel
execution on parallel pipelines
• ž Implicit multithreading is concurrent execution of multiple threads extracted
from single sequential program
• —Implicit threads defined statically by compiler or dynamically by hardware
Approaches to Explicit Multithreading
• ž Interleaved
• —Fine-grained
• —Processor deals with two or more thread contexts at a time
• —Switching thread at each clock cycle
• —If thread is blocked it is skipped
• ž Blocked
• —Coarse-grained
• —Thread executed until event causes delay
• —E.g.Cache miss
• —Effective on in-order processor
• —Avoids pipeline stall
• ž Simultaneous (SMT)
• —Instructions simultaneously issued from multiple threads to execution units of
superscalar processor
• ž Chip multiprocessing
• —Processor is replicated on a single chip
• —Each processor handles separate threads