High Performance Computing Unit 1-2

High performance computing unit 1&2 of sppu

Uploaded by

Siddhesh Ukirde

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

115 views

High Performance Computing Unit 1-2

High performance computing unit 1&2 of sppu

Uploaded by

Siddhesh Ukirde

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 60

Unit Introduction to Parallel Computing 09 Hours introduction to Parallel Computing : Motivating Parallelism, Modern Processor: Stored- program computer architecture, General-purpose Cache-based Microprocessor architecture. Parallel Programming Platforms : Implicit Parallelism, Dichotomy of Parallel Computing Platforms, Physical organization of Parallel Platforms, Communication Costs in Parallel Machines. Levels of parallelism, Models: SIMD, MIMD, SIMT, SPMD, Data Flow Models, Demand-driven Computation, Architectures : N-wide superscalar architectures, multi-core, multi-threaded. (Refer Chapter 1) #Exemplar/Case Studies Case study: Multi-core System *Mapping of Course Outcomes for Unit I|CO1 Unit Parallel Algorithm Design 09 Hours Principles of Parallel Algorithm Design : Preliminaries, Decomposition Techniques, Characteristics of| ‘Tasks and Interactions, Mapping Techniques for Load Balancing, Methods for Containing Interaction Overheads, Parallel Algorithm Models : Data, Task, Work Pool and Master Slave Model, Complexities: Sequential and Parallel Computational Complexity, Anomalies in Parallel Algorithms (Refer Chapter 2) #Exemplar/Case Studies IPoC: A New Core Networking Protocol for 5G Networks. }*Mapping of Course Outcomes for Unit 1102 Unit I Parallel Communication 09 Hours Basic Communication : One-to-All Broadcast, All-to-One Reduction, All-to-All Broadcast and Reduction, AllReduce and Prefix-Sum Operations, Collective Communication using MPI : Scatter, Gather, Broadcast, Blocking and non blocking MPI, All-to-All Personalized Communication, Circular Shift, [Improving the speed of some communication operations. (Refer Chapter 3) #Exemplar/Case Studies Case study: Monte-Carlo Pi computing using MPI ‘Mapping of Course Outcomes for Unit III |CO3 —$$—_—_\— Unit IV Analytical Modeling of Parallel Programs 09 Hours: [Sources of Overhead in Parallel Programs, Performance Measures and Analysis : Amdahl's and Gustafson's Laws, Speedup Factor and Efficiency, Cost and Utilization, Execution Rate and Redundancy, ‘The Effect of Granularity on Performance, Scalability of Parallel Systems, Minimum Execution Time and Minimum Cost, Optimal Execution Time, Asymptotic Analysis of Parallel Programs. Matrix Computation : Matrix-Vector Multiplication, Matrix-Matrix Multiplication. (Refer Chapter 4) #Exemplar/Case Studies \Case study: The DAG Model of parallel computation “Mapping of Course Outcomes for Unit IV |CO4Chapter 1: Introduction to Parallel Computing 1-1 to 1-37 Computing : Motivating Parallelism, Modern Processor: Stored- program computer architecture, General-purpose Cache-based Microprocessor architecture. Parallel Programming Platforms : implicit Parallelism, Dichotomy of Parallel Computing Platforms, Physical Organization of Parallel Platforms, Communication Costs in Parallel Machines. Levels of parallelism, Models : SIMD, MIMD, SIMT, SPMD, Data Flow Models, Demand-driven Icomputation, Architectures : N-wide superscalar architectures, multi-core, multi-threaded. 1.1 _ Introduction to Parallelism 1.2 Motivation for Parallelism. 1.2.1 Increase Number of Transistors in IC Moore’s Law. 122 Memory / Disk Speed Improvement 1.23 Data Communication Improvement... 1.3. Scope of Parallel Computing, 13.1 Applications of Parallel Processing in Medical 1.3.2 Applications of Parallel Processing in Weather-Climate Research . 13.3 Commercial Application. 1.34 Science and Engineering Application. 1.3.5 Scientific Applications 1.3.6 Applications in Computer System... Parallel Programming Platforms : Implicit Parallelism. 141 Pipelining Execution... 14 4142 Superscalar Execution. 1.43 Very Long Instruction Word Processors (VLIW). 1.4.3(A)_ VLIW Processor Structure. 1.4.3(B) Advantages, Disadvantages and Applications. 15 Trends in Microprocessor and Architectures 16 Limitations of Memory System Performance. 1.61 Impact of Memory Bandwidth.. 1.6.2 Hiding Memory Latency Techniques Dichotomy of Parallel Computing Platforms..... 1.7.1 Control Structure of Parallel Platform 1.7.1(A) SIMD Architecture. 1.7.4(B) MIMD Architectur 172 17 Communication Model of Parallel Platform. 1.7.2(A) Shared-Address Space Platform. 1.7.2(B) Message - Passing Platform. Physical Organization of Parallel Platforms and Stored Program Computer. 1.8.1 Evolution/Levels of Parallelism.. 182 Ideal Model for Parallel Computin, 18 Techh Publications1.10 iit 112 113 14 interconnect 191 Statle Networks 19.1(A) Linear and Ring Topologies 1.9.18) Meshes and ‘Torus ~~ 19.10) Hypercubes 4,9.1(D) Trees-- 4.9.(6) FallyConnected 1.9.1(F) Fat Tree Dynamic Networks ~ ss Based Interconnect Ne Network nor 192 19.2(A) Bu work. 1.9.2(B) Switch Based Interconnect Networknnw ache-Coherencein Mult-Processor Systems 1.101 Snooping Solution (Snoopy Bus) ~ Directory Based Protocol. ‘Superscalar Processor HardWar ne 1.135 Pipelining in Superscalar ProceSSOTS mmvwnemmsnnrnn Malti-core Architecture... 24 22 Syllabus = Principles of Paralel Algorithm Design : interactions, Mappi : ign : Preliminaries, Ds a ipping Techniques for Load Balancit 5, Decomposition Technique i les, Chi lancing, Methods for Containing Interaction Ove aracteristics ee jetheads, Parallel Algor Models: Date, Task, Work Pool and Maste araralesin Paral Agrtong Nadel Completes: Sequential and Paral c ‘omputational Compl Principles of Parall lel Algorithm Design - 214 Preliminaries. ign - Preliminaries. 242 Decom : position, Tasks ar 7 241.2(A) Decompositior nd Dependency Graphs 2.1.2(B) Tasks ae Task Dependency Graph omposition Techni 22a = Pata DecompositionHigh Performance Computing 3 222 Recursive Decomposition, 223 Exploratory Decomposition un. 2.24 Speculative Decomposition.. 23 Characteristics of Tasks and Interactions. 24 Mapping Techniques for Load Balancing 2.5 Methods for Containing Interaction Overheads. 25.1 Maximizing Data Locality. 252 Minimizing Contention and Hot-Spots. 253 Overlapping Computation with Interactio 254 Replicating Data or Computation. Table of Contents 255 Using Optimized Collective Interaction Operations.. 2.6 Parallel Algorithm Models. 261 Data Parallel Mode. 262 Task Graph Mode... 263 The Task Pool Model.. 2.64 — Master-Slave Model — 265 Pipeline / Producer-Consumer Problem... 2.66 Hybrid Model nnn 27° Complexities..... 2.71 Sequential and Parallel Computational Complexity. 2.72 Anomalies in Parallel Algorithm: Chapter 3: Parallel Communication ie Boe Ss 2 31 t0316 Syllabus : Basic Communication : One-to-All Broadcast, All-to-One Reduction, All-to-All Broadcast and Reduction, All-| Reduce and Prefix-Sum Operations, Collective Communication using MPI : Scatter, Gather, Broadcast, Blocking and non| blocking MPI, All-to-All Personalized Communication, Circular Shift, Improving the speed of some communication’ loperations. 341 One-to-All Broadcast and All-to-One Reduction... 31.1 One+to-All Broadcast. 3.1.2 Allto-One-Reduction. : 3.1.3 One-To-All Broadcast and All-To-One Reduction on Rings or Linear Array. 3.14 Example of One-To-All Broadcast using Recursive Doubling on Ring. 3.15 Example of All-To-One Reduction using Recursive Doubling on Riny 3.1.6 Cost Analysis of One-to-All Communication. 32 All-to-All Broadcast and Reduction. 321 Allo All Broadcast. 3.22 Alkto-All Reduction unm 323 Example of All-To-All Broadcast Operation on Rin; 3.24 Cost Analysis for All-To-All Operation on Ring. poplieationsIntroduction to Parallel Computing —< tmtroduction to Parallel Computing : Motivating Parallelism, Modern Processor: Stored- program computer architecture, General-purpose Cache-based Microprocessor architecture. Parallel Programming Platforms : implicit Paralielism, Dichotomy of Parallel Computing Platforms, Physical Organization of Parallel Platforms, Communication Costs in Parallel Machines. Levels of parallelism, Models : SIMD, MIMD, SIMT, SPMD, Data Flow Models, Demand-driven Computation, Architectures : N-wide superscalar architectures, multi-core, multi-threaded. 11 Introduction to Parallelism Parallel computing is a type of computation in which many calculations execute at the same time. Parallel computing is based on the principle that a computational problem can be divided into smaller subproblems which can then be solved simultaneously. Parallel computing assumes the existence of some sort of parallel hardware which is capable of undertaking this computation simultaneously. Primary reasons for using parallel computing are its provided concurrency, save time and solve larger problems. Fig. 1.1.1: Advantages of Parallel Computing over Serial Computing To solve larger problems on Serial Computing is impractical. When the local resources are finite, parallel computing can advantage of non-local resources. The potential of computing power wastes in serial computing whereas parallel computing makes better utilization of hardware and computing resources. Parallel computing saves time and money as many resources working together will reduce the time and cost.High Performance Computing 12 Introduction to Paral 1.2 Motivation for P% jelism @. How to Improve speed of communication operations? We will take a short glance into the history of parallel computing to better understanding its origin go. esp recent rise in popularity. Parallel execution was present ever since the early days of computing when comp, still mechanical devices. In accelerating computing speeds, the role of parallelism has been recog decades. 1.2.1 Increase Number of Transistors in IC Moore's Law ‘The computational power enhancement through increase number of transistors in IC Moore's Law (1965) + "The complexity for minimum component costs has increased at a rate of roughly @ factor of two per yey Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, ng rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost vil jg 65,000." + This statement is by Moore in 1965, which predicted that over the next 10 years computers will become fast, very fast. Now that 10 years elapsed and it turned out that computers continued to become fast and also computer speed is going to be essentially doubling every 18 months and that happen by variety of techniques. ‘+ In 1975, Moore predicted that we have essentially saturated out intelligence, we cannot make innovation inte making things better and faster. ‘+ But still just by advent of technology of being able to lay thinner wires on silicon, we will able to pack in more and more transistors which will in turn make computers faster. There are two parts, one was that you can diive the clocks much faster, drive the gates using a faster clock which have saturated in last few years. + In modern processor, the power of CPU increases by adding additional transistors in the integrated circuit or chip. 1.2.2 Memory / Disk Speed Improvement ‘+ The overall speed of the computation is not only dependent on the speed of the processor but other factors a also involved for the same. + Like memory and disk speed also play important role in computation speed improvement. Parallel platforms provide increased bandwidth to the memory system, + Using principle of locality of reference which is used to manage the mismatch between the memory and Processor speed and using bulk access, we can apply memory optimization techniques for parallel algorh” design. + Parallel platforms also provide higher aggregate cache memory. ‘In parallel computing, most of the application areas rely on the ability to pump data to memory and disk faster not on the computational rates. >Introduction to Parallel Computing igh Performance Computing 1.23 Data Communication Improvement ’As the networking infrastructure evolves, the vision of the Internet as one large computing platform has emerged In many applications like databases and data mining, the volume of data is such that they cannot be moved. «Any analyses on this data must be performed over the network using parallel techniques. 1.3. Scope of Parallel Computing {@. What are the applications of Parallel Computing ? Ea @. Describe the scope of parallel computing. Give applications of parallel computing. ‘There is always a need of increase in the speed for a computing system. The computations required from a system have always been increasing. All the complicated operations in research, medical, weather forecast, artificial intelligence, automation, defence, Aerodynamics, Biology, consulting, Database, Electronics, Energy, Environment modeling, Finance, Geophysics, Information service, Life Sciences, Telecommunication, Transportation, Weather- ial Intelligence, Computer vision etc. required powerful computation, leading to the climate research, Art necessity of high performance. Many things have been implemented to make the system faster. in the beginning, the processor developers came up with a prefetch system (in 8086 processor). Thén it was found that although the instructions were prefetched, it ‘was not performing as fast as expected. Researchers came with new concepts to increase the speed. They came up with different things like cache memory, internal cache, pipelined processors, superscalar processors, parallel computing etc. These things evolved one after the other to increase the speed of the system. + Parallelism finds applications in very diverse application domains. Scientific application 2. Commercial applications 3. Applications in Engineering and Design 4. Application in Computer systern 5. Weather-Climate Research 6. Application of parallel processing in Medical 1.3.1 Applications of Parallel Processing in Medical * In medical, there is a requirement of very fast computation. This can be implemented by a huge scale of Parallelism. The modern computing has made many things viable in the medical industry. Especially many medical applications that require imaging and processing on these images, require parallel computing. * ¢ Magnetic Resonance Imaging (MRI) for example that is used see inside the human body requires very fast as well as huge computations. These computations that develop the 3D imaging of inside of the body, need to also have very accurate calculations. Magnetic Resonance Imaging (MRI) is done to find bleeding, tumors, injury, blood vessel diseases, infection etc. it uses magnetic field and radio frequency pulses to make the pictures of the organs inside the body. Thus while scanning the body, the image capturing, computing and reproducing must be fast enough so as to not allow the patient remain in the radio frequency waves for long. Tech) PublicationsIntroduction to Paralley, a o (MPP) aF6 USEC ATAY prog Ms 1 processing a High Performan: ‘* Various paralle!_ mechani multicomputer processing are normal 1.3.2 Applications Weather forecast is @ very important . aot of parameters. requires various information bke: 1. Humidity 2. Wind direction and speed Daily maximum and minimum air temperati Research sear, fishing, ete THE weather-climate forecas 7 My ures Total solar radiation Carbon dioxide concentration ‘Ocean sea ice 7. Cloud physics parameterization and many more things. dels are said to be computational itn, rked on simultaneously. Climate mov s ed on statistics of previOUs Century on thea. a + Allthese parameters are to be wor because, there is a lot of stuff to calculate. The prediction is base too much. For example, the huge earth (almost a sey vi of latitudes and longitudes. The cells formed scale with precision of a minute. Hence the calculation Is has to be divided into discrete units, and thus forming a of non-uniform size as shown in Fig. 13.1 Fig. 1.3.1 : Non-uniform cells of the earth ‘* These calculations require spherical co-ordinates (0, 4) which are uniform. It requires spectral transforms or FFls Thus the computation ofthe parameters for sucha huge grid will require huge time on non paalel 9st ‘Another option is dividing the surface into parts to form two dimensional plane an work on these small pa eee Ths il a require special type of topological arrangement of processors. The algo” wi egies ‘computations, must have ver : whatsoever y high accuracy, vector Si ications I processing, commu huge cache memory, petaflop (a unit to measure speed) computing, large database etc. ' 1.3.3 Commercial Application © In the current market trends as we ha\ Ne (0 Perform many activites simultaneously, it requires more process leo as commercial a : i pplication. ‘Aso commercial ppliatins require the procesng of age amou ints of data, + Forexample, 1. Big Data, data mining, databases 2.__Astificialinteligence Teale peneiHigh Performance Computing 1 Introduction to Parallel Computing 3, Financial and economic modeling 4, Web search engines, web based business services, 1.3.4 Science and Engineering Application Parallel computing is used to model difficult problems in many areas of science and engineering ike «Bioscience , Genetics and Biotechnology «Chemistry , Modular science «Mechanical engineering- spacecraft ‘© Electrical engineering, circuit design and in microelectronics «Geology, Seismology and many more 1.3.5 Scientific Applications ‘+ To show the simulated behaviours of real world entities by using mathematics and mathematical formulas, scientific applications are created. ‘© This means the object exists in the real world are mapped in the form of mathematical models and the actions. present in that object are simulated by executing the formula. ‘Simulations are based on very high-end calculations and require the use of powerful computers. The most of the powerful computers are parallel computers and do the computations in parallel form, ‘+ The scientific applications are major candidates for the parallel computations. For example, weather forecasting and climate modeling, oil exploration and energy research, drug discovery and genomic research etc. are some of the scientific applications in which parallel activities are carried out throughout the completion. 1.3.6 Applications in Computer System ‘+ Byusing collection of low power computing devices in the form of clusters made possible to utilize the computing power of multiple devices for solving of large computing tasks efficiently. A group of computers configured to solve a problem by means of parallel processing is termed as cluster computing systems. 1.4 _ Parallel Programming Platforms : Implicit Parallelism 14.1 Pipelining Execution * Aprocessor has many resources like the ALU, buses, registers, etc. An attempt to utilize all these attributes to their fullest or continuously can be achieved by pipelining, In’a pipelined system the instructions flow through the processor as if the processor was a pipe. * The instructions move from one stage to another to accomplish the assigned operation. Hence at most of the times each unit of the processor is busy handling one or the other instruction, making the attribute of the Processor being used continuously. * Ina non-pipelined system, the processor fetches an instruction from memory, decodes it to determine what the instruction was, read the instructions inputs from the register file, performs the computation required by the instruction and writes the result back into the register file. This approach is also called as unpipelined approach. Teatroduction to Parallel Comp F High Performance Computing z= ‘ x ach of these steps (| —eded to perform each of Bs Gnstruction fg ack) is different and most oy the * The problem with this approach is that, the hardwa ae instruction decode, register read, instruction execution and rt ample hardware is idle at any given moment. Waiting for the ome parts o ae executing an instruction. _ rons to reduce the execution + Pipetining isa technique for overlapping the execution of sera instructions testy of instructions, instruction and Execute instruction. ction oveHpPiNg 18 when the ga © Two stage pipelining includes two stages i.e. Fetch inst ruction ind when this instruc n and the next instr nis executed the next is fetched ang ‘+ These two operations are performed for one inst instruction is being executed the next is fetched a on. ction when current instruction is Being executed. Pipeline is the process of fetching the next instru peeds up the processor © sy operating and none of thi sor increases i.e. more the number of pipeing This method of execution the instructions in pipeline sf peration. This also makes sure that all the units of the processor are bt rem is starving. Thus with the help of pipelining the operation speed of the proces stages, faster becomes the processor, but complex in design- peline stage is as shown in Fig. 14.1 = Asimple two stage instruction pil Fig. 1.4.1: Two stage pipelining plemented in almost all the modem processors. © Pipelining is 1.42 Superscalar Execution The processors were scalar processors. In.scalar organization, a single pipelined functional unit (part of CPU responsible for calculations) exists for integer operations and one for floating-point operations. To achieve parallelism single pipeline is not sufficient, pipelines allow for performance increases through parallelism. his concert Parallelism is achieved by enabling multiple instructions to be at different stages of the pipeline. And t is implemented using superscalar. ‘A superscalar processor is designed to execute more than one instruction at a time during single clock ol Common instructions like arithmetic, load and store can be initiated and executed independently. The supers processor fetches and decodes several instructions at atime. ° . The superscalar architecture exploits the potential of Instruction Level Parallelism (ILP)._—_ 1 Introduction to Parallel Computing ugh Performance computing «slike integer and floating operations. struction: [Festa pare water] point [Festa pare water] file Ell ind ger pained toaing: “teal nts point untorl units Fig. 1.4.2 : Superscalar Organization falar scheduler is a major consideration in processor 142, we have multiple pipelines for different i Tnleger ropistor tle design. instructions plexity and hardware cost of the supersc le time analysis to identify and bundle together i \VUW processors rely on compil * Thecor To address this issue, that can be executed concurrently. 1d dispatched together and thus the name very long instruction word. «These instructions are packed an 4.43. Very Long Instruction Word Processors (VLIW) ‘a language compiler or pre-processor breaks \VUW describes a computer processing architecture in which sformed by the processor in parallel (at the same program instructions down into basic operations that can be pe! time) It refers to a processor architecture designed to take advantage of instruction level parallelism, performance and multiple operations performed simultaneously Itis less complex approach to allow higher 1S grouped together ie it can store In VUW processor, instructions consist of multiple independent operat muttiple instructions in single word. 1D R1,R2; SUB R1,R5; LOAD R2,data; store R3,data; + For Example : AD! © Atypical word length considered from 52 bits to 2kbits. «There are multiple independent functional units. Each operation in the instruction is assigned to different functional units. All functional units share the use of a common large register file. 1.4.3(A) VLIW Processor Structure fh Coes Explain basic working of VLIW Processor. + To perform multiple operations in a single execution stage, we need to have separate units to perform each of these operations. To perform the different operations like floating point add, multiply, branching, integer ALU etc. ‘we need separate units for each of these operations this is shown in the Fig. 1.4.3. ‘© Fig. 1.43 shows the architecture of a typical VLIW processor. ‘VLIW (Very Long Instruction Word) architectures are used for execution of more than ‘one basic instruction at a time in a program. + VUWarchitecture stores multiple instructions in a single word. The VLIW consists of muitple operations in a single instruction. For example, operations tke FP Add (Floating Point Adcition, FP Multiply oating Point Multiplication), Branching instruction, Integer ALU operation etc. Hence on issuing one instruction, multiple ‘operations are executed simultaneously during the execution cycle of the pipelining, Teck PuniteationsIntroduct Performance Computing, Fig. 1.4.3 : Typical VUW architecture rallelism and schedule dependency fr sxecuted in paral The VLIW processor relies on compiler to find par ree program code. In VLIW based system a parallel compiler is used to general lel in the same word. The compile is responsible for resolving dependencies among In rocessor pendent operations and have te operators to be © structions at compile time. is that the instruction has multiple oper. no flow dependences between ations. The = The special characteristic of a traditional VLIW Pr multiple operations given in the instruction are inde ‘these operations. 1.4.3(B) Advantages, Disadvantages and Applications 1. Advantages (@) No runtime dependence checks against previously or simultaneously issued operations. (&) No runtime scheduling decisions. (No need for register renaming 2. Disadvantages @_ No tolerance for any difference in the types of functional units. (b) No object code compatibility. (In Superscalar and VLIW processors, more than a single instruction can be issued to the execution units pe gycle. (6) Superscalar machines aré able to dynamically issue multiple instructions each clock cycle from a conventional linear instruction stream. (©) VLIW processors use’a long instruction word that contains a usually fixed number of instructions that art fetched, decoded, issued, and executed s) , issued, synchronously. ic i eee sly. Hence, Superscalar has dynamic issue, while VLIW has 3. Application Itis suitable for Digital signal processing. 1.5 Trends in Microy rOcessol pt r and Architectures Mic icroprocessor clock speeds have posted impressive gains over th Higher levels of devi oo aren Hore ice integration have made available a large number on Juest i ; question of how best to utlize these resources isan import onsen ‘urrent _ Processors use these resources in multiple functional - ional cycle. - units and execute multiple instructions in theHigh Pe ‘The precise manner in which these instructions are selected executed provides impressive diversity in architectures, yrmance Computing 1 Introduction to Parallel Computing The major trend in commercial microprocessor architecture is the use of complex architecture to exploit the ILP {instruction-Level Parallelism), «There are two approaches that are used to exploit the ILP : Superscalar and Very Long Instruction Word (VLIW), ‘+ Both approaches attempt to issue multiple instructions to independent functional units at every clock cycle, ‘+ Superscalar uses Hardware to dynamically find data independent instruction in an instruction window and issue them to independent functional units. © On the other hand, VLIW relies in the compiler to find ILP and schedule the exec of independent instruction statically. + Currently most of commercial microprocessors such as Intel Pentium, Compaq Alpha 21264, BM Power PC620, Sun Ultra Sparc, HP PA80OO and MIPS R100000 use Superscalar design technique. + Performance of this microprocessor has been improving at a phenomenal rate for decades. This performance growth has been driven by (1) the innovation in compilers, (2) the improvements in architecture and (3) tremendous improvements in VLSI technology. + The latest Superscalar microprocessors can execute four to six instructions concurrently with many nontrivial techniques including dynamic branch prediction, out-of-order execution, and speculative execution method. ‘+ However speedup may not be achieved by using these techniques because of the limitations’ of the instructions ‘window size and the ILP in a typical program. ‘+ Moreover considerable design efforts are required to develop such high performance microprocessor. ‘+ Therefore developing a’ complex wide issue-superscalar microprocessor as a next generation microprocessor may not be an efficient approach to satisfy the required performance. 1.6 _ Limitations of Memory System Performance a Explain the impact of Memory Latency ahd Memory Bandwidth on system performance. ‘+ Effective performance of a computer program on a computer relies not just on the speed of processor but also on the ability of the memory system to feed the data to the processor. *. Memory system performance is largely captured by two parameters, latency and bandwidth. Latency is the time from the issue of a memory request to the time the data is available at the processor. Latency does not provide complete information about performance of the memory system. * Bandwidti is the rate at which data can be pumped to the processor by the memory system. To improve the memory latency, cache can be used, * Caches are small and fast memory elements between the CPU and main memory. * Cache memory is used to enhance the access speed of any storage devices. The properties of cache are low-latency and high bandwidth storage. If piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. As if ny application is already fetched by the processor then itis saved in cache and next time it can be fetched by the cache, ublications7 Introduction to Parallel Comput, % High Performance Computing Cache hit - the fraction of data references satisfied by the cache is called the cache hit ratio and otherwise ij ; called cache miss ratio. * Now we will discuss bandwidth as one of the parameter that affects memory performance. 1.6.1 Impact of Memory Bandwidth Q. Explain the impact of Memory Latency and Memory Bandwidth on system performance. er Memory Bandwidth is the rate at which data can be read from or stored into memory by a processor. Memory bandwidth is determined by the bandwidth of the memory bus as well as the memory units. ‘Memory bandwidth can be improved by increasing the size of memory blocks. 1.6.2. Hiding Memory Latency Techniques 9 memory latency There are two approaches of hi 1. Pre-fetching 2 Multithreading For Example : Suppose we are browsing the web on a very slow network connection. How to deol with this problem? Two possible ways : We think likely which pages we are going to browse ahead of time and issue requests for them in advance. = We open multiple browsers and access different pages in each browser, thus while we are waiting for one page t load, we could be reading others. 1. Pre-fetching for latency hiding Misses on loads cause programs to stall. Why not advance the loads so that by the time the data is actually needed, itis already there. The only drawback is we might need more space to store advanced loads. 2 Multithreading for latency hiding © In this in every clock cycle, we can perform a computation to solve problem of latency. Using multithreading approach, the processor is capable of switching threads at every cycle. Multithreaded processor can switch the context of execution in every cycle. Consequently, they are able © hide latency effectively. lel Computing Platforms 1.7 Dichotomy of Pa Parallel program specify coneurrency and interaction between concurrent subta i . sks. Concurreng also referred to as the control structure and interaction referred as communication model Sometimes one 1.7.1 Control Structure of Parallel Platform * Parallelism can be expressed at various levels of granularity, from instruction level to processes. * In parallel computing, granularity is a qualitative measure of the ratio of computation t. \unication. to communication. Tes pale y_ at Introduction to Parallel Computing nigh Performance Computing structure inthis system is divided into multiple smaller parts and is granules slaty s used to describe about the division ofa task into numbers ofa smaller subtass, © Gran Granary in parallel program is considered as different level, Example program level and instruction level There are two types 1. Fine grained : When system is 2. Course grained : System is divided into smaller numbers of lage partis called as course grained, + Processing units in paralle work independently. + Twotypes of models 4. SIMD model (Single Instruction Stream, Multiple Data Stream) divided into large networks of small partis called fine grained. ll computers either operate under the centralized control of a single control unit or 2. MIMD model (Multiple Instruction Stream, Multiple Data Stream) 4.7.1(A) SIMD Architecture OES ‘Explain wit suitable diagram SIMD architecture, in this case a single control unit that dispatches the same instruction to multiple processing elements, but wok on different data. «This kind of system is mainly used when many data (array of data) have to be operated with same operation. «Vector processors and array processors fal into this category. Fig. 1.7.1 shows the structure of a SIMD system Fig. 1.7.1 : SIMD architecture 1.7.1(B) MIMD Architecture @. Explain with suitable diagram MIMD architectur om | Q. Write a short note on : Spee () Detafiow Models (W) Demand Briven Computation (il) “Cache Memory + This is a complete paraliel processing example. Here each processor has its own control unit; each processing -element can execute different instructions on set of different data. * Examples of this kind of systems are SMPs (Symmetric Multiprocessors), clusters and NUMA (Non-Uniform Memory Access). Fig. 1.7.2 shows the structure of such a system. Ted Punlicationshigh Performance ComPutthg Fig. 1.7.2: MIMD architecture ‘Table 1.7.1 : Comparison between SIMD andl MIMD ‘SIMD fAIMD- Itis also called as multiprocessor. Its also called as Array processor. Here, multiple streams of instructions are fetched. Here, single stream of instruction is fetched. The instruction streams are fetched by control unit. The instruction stream is fetched by shared memory. Here, instruction is broadcasted to multiple | Here, instructions streams 2r° decoded to get multiple processing elements. decoded instruction streams. .an MIMD. | Requires more hardware. ‘SIMD computers require less hardware th a) Data flow model : The Von Neumann model, the data flow model needs system are represented diagrammatically by a data flow model. Through the description of the steps involved in it to file storage and report production, data flow models are used to clearly represent the more directions. Information interchange and flow within a moving data from input flow of data in an information system. b) Demand Driven Computation : ‘A demand-dri ee — -drven computer selects 4 top-down method by requiring the evaluation of the first demanding value, . ; in turn requires the evaluation of the next lével expressions, "(b+1)*c and (d/e)," which in tt res 2 eee : \ \* which in turn requi wee wan a 'b+1" at the innermost level. Before an is assessed, the outcomes are th bi te the femander in th i ' fen given back to nested ’e opposite order. Demand-driven computing is synonymous wi ° i erations are only carried only when their outcomes are needed by sub: ymous with lazy evaluation since ° — sequent instructions.igh Performance Comput ane Communication Model of Parallel Platform 1 forms of data exchange between parallel task 1.72 ‘There are two primal The basic forms are: 4, Shared address space approach 2 2(A) Shared-Address 5} ‘be Uniform-memory-access and No! Message passing approach pace Platform v-niform-memory-acoess with diagrammatic representation. @. Des Jnared-data machines or multiprocessors. «that provides shared data space are called s y all processors included in the system. Part or al of the ‘© Platform: .s common data space accessible by «This platform pro memory is accessible to all processors. «5 processors interact by modifying data obj Classification of shared address platform : jects stored in this shared address space. The shared address platform is classified as, 4, NUMA (Non-Uniform Memory Access) 2. UMA Uniform Memory Access) ifthe time taken by a processor to access any memory word in the system is identical the platform is classified as a Uniform Memory Access (UMA) else, a Non-Uniform Memory Access (NUMA) machine. ‘The main difference between the NUMA and UMA is the location of the memory. Fig. 1.7.3 1. NUMA (Non-Uniform Memory Access) + Inthe NUMA shared memory architecture, directly. At the same time, it can also access any memory module belonging to ancther Proc shared bus or any type of interconnect. ‘These systems have a shared logical address space, access time to data depends on data position. * Aprocessor has direct path to the black of its local memory attached to it. each processor has its own local memory module that it can access ‘essor using ut physical memory is distributed among CPUs, so that * All processors can see all memory. Techl Publications’Ss duction to Parallel Cony High Performance Computing 114 Introduction to Parallel Comp, ‘Access to the memory of other processors is slower, [nea] Fig. 1.7.4: NUMA architecture 2. UMA (Uniform Memory Access) All processors share a unique centralized primary memory, so each CPU has the same memory access time. Each processor gets equal priority to access the main memory of the machine. ‘These systems are also called as Symmetric Shared - Memory Multiprocessors (SMP). Fig. 1.7.5: UMA Architecture 1.7.2(B) Message - Passing Platform Platforms that support messaging are also called message passing platforms or multicomputers. These platforms exchange messaging for sharing data. This model allows multiple processes to read and write data to the message queue without being connecte = each other. ‘Messages are stored on the queue until their recipient retrieves them. {n Fig 1.76 both the processes P1 and P2 tan access the message queue and store and retrieve data. These platforms comprise of a set of processors and their own exclusive memory.phigh Performance Computing 115 Introduction to Parallel Computing 18 Physical Organization of Parallel Platforms and Stored Program Computer Here we are going to discuss the physical architecture of parallel computers. Here conventional architecture refers to the uni-processor system. Some of the parallelism features can definitely be implemented on a single processor to improve the speed, but there isa limitation to the same. ‘The architecture of processors began with the IAS's Von Neumann Computer. Fig, 1.8.1 shows the architecture of the Von Neumann architecture. This system has three units CPU, Memory and i/O devices. The CPU has two units Arithmetic Unit and Control unit. ‘The Von Neumann machine uses stored program concept. The program and data are stored in the same memory unit Each location of the memory has a unique address i.e. no two locations have the same memory address. Execution of instruction in Von Neumann machine is carried out in a sequential manner (unless explicitly altered by the program itsel) from one instruction to the next. Fig. 1.8.1: Von Neumann Architectie of a computer 1.8.1 Evolution/Levels of Parallelism ‘The features in processors were slowly developed to give parallelism and hence fast processing. Fig. 1.82 shows ‘this architectural evolution tree. You will notice in the Fig. 1.82, that the first step towards parallelism was Look ahead i. overlap of fetch and. execute and the concept of parallelism in functions. The parallelism in functions was implemented by two ‘mechanisms as seen in the Fig. 1.82, pipelining and multiple functional units. In the second mechanism, multiple functional units were implemented that would operate simultaneously. Vector instructions are-a kind of huge array of data that has a common operation to be performed on them. For Vector instructions, initially pipeline processors were used controlled by software looping, Later explicit processors were made for them, There were again two variations in the vector processing, memory to memory and register to register. The first one makes use of memory to store the operands ie. the operands are loaded and stored in ‘memory; while the second one uses registers to store the operands.Introduction to Parallel Computn, a High Performance Computing gle Instruction Multi, to two types of processors namely ‘e _ The register to register architecture further evolve Data (SIMD) and Multiple instruction Multiple Data (MIMD). Soalar Toa | ‘Sequential ‘ Fetch and execute overlap Functional Paraillism ‘Mutiple unetional units Pipeline Tmpliat Veotor [os Boman ‘Memory to Memory Fogiewrio Register | ‘siMO_ ‘Mimo ae (eel Cao Fig. 1.8.2: Architectural evolution tree 1.8.2 Ideal Model for Parallel ‘Computing ‘Without considering the physical constraints and implementation details, the ideal model should give a suitable framework for developing algorithms. an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms. It helps © PRAM provides write parallel algorithm without any architecture constraints. PRAM (Parallel Random Access Machine) The PRAM is used by parallel algorithm designers to model parallel algorithmic performance. ‘© Itisa shared memory multiprocessor «tt has unlimited number of' processors which is able to access the shared memory in constant time and ts unlimited local memory. It can be suitable for modern day architectures , example GPU PRAM architecture model consists of control unit, global memory and an unbounded set of processors each wi! own private memory. Fig. 1.8.3 : PRAM ArchitectureHigh 19 memory unit directly or indirectly. Interconnection network is needed to route data w memory. An a sgecutes in SIMD model. rious PRAM models differ i hte Vat a ‘Any PRAM model For example, possi Parallel computing system consists of more than one processor and these processing elem Performance Computing Interconnecti 1:17 Introduction to Parallel Computing performs computation and writes to global memory Pack vaive processor reads from global memo in how they handle read or write conflicts EREW : Exclusive Read Exclusive Write - p processors can simultaneously read and write the content of p distinct memory locations. CeREW : Concurrent Read Exclusive Write « P Processors can simultaneously read the content of p’ memory Tocations , where p’ < pand simultaneously write the content of p distinct memory locations. CREW : Concurrent Read Concurrent Write (COMMON : Al processors writing fo same BITRARY : One of the competing processor RITY : Processor with the lowest index writ J or algorithm can execute any other PRAM mi ible to convert PRIORITY PRAM to EREW PRAM. global memory must write the same value i ARI $s value is arbitrarily chosen Gi) PRIOI es its value. jodel or algorithm. ion Network for Parallel Computer Fig, 1.9.1 : Interconnection Networks yents are connected to hen processor needs to access Classification for Interconnection Networks Itis divided into static and dynamic classes. * The static networks have passive connections ie. the connections among the processing elements or ‘communication nodes are fixed and cannot be reconfigured to have a different connection path. For example if there are three processors connected such that processor ‘a’ is connected to processor ‘b’, and coo is connected to processor ‘¢. Then, if data is to be transmitted from the processor 'a' to processor ‘c, it cannot be directly transmitted; instead it is to be routed through the processor ‘b’ Techiaewledge PunticationsIntroduction to Parallel Co a In case of dynamic network connections, the interconnection can be reconfigured to establish new paths as ang when required. Considering the previous example of transmitting the data from processor ‘a’ to processor ‘c, in case of dynamic network, we can establish a path dynamically between the processors ‘a’ and ‘c. Table 1.9.1 shows the differences of the static and dynamic networks. Table 1.9.1 : Difference between Static and Dynamic networks Static Networks Dynamic Networks 1. | The connecting paths between the processing | The connecting paths between the proces: elements of the static networks are static or | are dynamic or active. passive, 2. | Establishing of links between two processor | Establishing of links between two processor during the during the execution of program or | execution of program or dynamically is possible. dynamically is not possible. Dynamic networks are made up of channels that can be 3. |The static network is made up of fixed to be present and being removed or processor to processor or point to point | switched connections. disconnected. 4. | These networks are used in a distributed | These networks are used in a shared memory system with system. multiple processors. ‘There are various types of dynamic networks like buses, crossbar switches, single stage dynamic network, multi stage dynamic network etc. 5. | There are various types of static networks like linear array, ring, tree, star, mesh, cube, hypercube etc. 1.9.1 Static Networks We will see the different topologies used in the interconnection networks. The different network topologies are given in Fig. 1.9.1 Before going through the different topologies, let us see the different parameters that are to be studied for ‘measuring the performance of the static interconnection topologies. The number of links is one of the important parameter to be considered for the cost aspect of the network. More the number of links more is the cost. But if the performance increases because of a slight increase in number of Jinks, then itis affordable. The degree is yet another term that is important in measurement of cost involved in making the topology. This term corresponds to the maximum number of adjacent processors a processor is connected directly to. In this case we consider the worst case condition i. the processor that has the maximum number of processors connected to it directly. The diameter (Permutation cycle) is important parameter to measure the performance of a network. It is the maximum number of processors that a message has to route through to reach the farthest processor. In case of 2 network there will be multiple processors, hence the worst case condition is considered to measure this parameter. 1.9.1(A) Linear and Ring Topologies {In case of a linear array, all the processing elements are connected in a series as shown in Fig. 1.92.High Performance Computing 1:19 Introduction to Parallel Computing ‘The number of links required in this case is ‘n~1', where ‘nis the number of processing elements. The degree of the edge PEs is ‘1’, while that of the mid PEs is ‘2’ ie. the number of PEs connected on the edge PEs is'T,, while the number of PEs connected to the mid PEs is’2’ each, «The diameter of this network topology is ‘n 1", Since in this case the first and the last processing element have to communicate through all the ‘n-1" processing elements. «This topology is not efficient when the numbers of processing elements are very large. ‘+ Aslightly better version ofthis is the ring topology shown in Fig. 1.93. ‘+ Inthis case the number of links required is 'n’, where ‘nis the number of processing elements, Le. 2 PEs are connected to each PE in this topology. «The degree of all the processing elements in this case is“ ‘+ In this case the diameter or permutation cycles is n/2, since the processors are connected in a ring, one end processor can communicate to the other end processor through n/2 processing elements. ‘This topology is slightly better than the linear topology, but again is not efficient enough in case of huge number of PES. + Incase of ring topology, there is a connection from the last to the first processing element that allows simpler and {aster communication paths for the processors connected on the edge in the linear topology. ‘+ Abbetter version of this is a star connection, wherein there are all processing elements to a common processing ‘element at the centre of all other processing elements. Ths is shown in Fig. 1.9.4. Fig. 1.9.3 : Ring topology The diameter or permutation cycles of a star topology network are just ‘2. From any processor to any other Processor, we need just two processor to communicate, But a huge number of links to one central processing element. * The degree of this topology is ‘n ~ 1’ since the central Element is connected to ‘n - 1' other processing elements. ‘Aso the number of links is ‘n ~ 1’ since there are ‘n ~ ‘processing elements connected to the central processing element. Fig. 1.9.4 : Star connectionos’ High Performance ‘Computing 1.9.1(B) Meshes and Torus mm. This kind of connection is used ir a special connection of processing elements in a mesh form. Thi ins next section. 2 Amesh is cases like for matrix multiplication that will be studied in the ca Le. ead « connected to 4 neighbouring other process, 1s The degree of this topology i +h processing element i elements. «The numberof inks is 2(r?~ This can be seen in the Fig, 1.95 which is a 6 x 6 mesh of «The diameter or permutation cles is 2 (n ~ 1), where the size sloments.f you ty connecting from the processing element on one comer Wi require 2 (n ~ 1) processing elements. For example in the 6 x 6 mesh giver to reach from the one end to another. processing elements, of the mesh is nxn Le there are MF processing ith that on another corner we wi nin the Fig. 1.9.5, it requires 1, processors «Avariant of the mesh topology is the torus topology. | form, we have ring array connected in mesh form. This is inthis case, Instead of the linear arrays connected in ary shown in Fig. 1.95. Fig. 1,9.5:2-D Mesh ie. 4 elements are connected to every element. The numbers of links are slight In this case also the degre more in this case ie. '2n’. oa , fe e ae or permutation, cycles is‘ (forthe n x n mesh), for example to communicate from the processor one end to the processor at other end in the 6 i ° : x 6 torus in Fig. 1.96, we need to eae : 96, go through only 6 proc nts. Hence in terms of performance increase because of some extra connecti ae eer jon paths compared to the mesComputing igh Performance Computing 121 Introduction to Parallel Computing 1.9.1(C) Hypercubes 9 oO oo oY Ge od 1d ” my o o © CO) © Fig. 1.9.7: Hypercube of different dimensions ie. 0, 1,2, 3 and 4 + There are different cases for the cube networks. ‘The five different dimensions of cubes are shown in Fig. 1.97. In these cases you will find that the dimensions are 0, 1, 2,3 and 4 respectively. The numbers of processing elements are 2 raised to the dimension ie, 1, 2, 4, 8 an 16 respectively. ‘+ These network topology also has a different name, but it varies according to the number of dimensions. For example the 3D hypercube is also called as 3 cube network, similarly a 4D hypercube is called as 4 cube network and so on. ‘The diameter or permutation cycles of a hypercube is 'k, where k is the number of dimensions. For example, in a 30 network, if the communication is to be done between the left top comer processor and the bottom right processor, it has to go through 3 different processing elements i. the diameter. This is a major advantage of the hypercube connected network. This topology is commonly used in sore early machines like Intel's IPSC, NCUBE etc. It can be used to implement many algorithms that have a good performance. The processing elements are given special addres, so that the addresses of the adjacent processing elements differ only by one bit. For this a gray code addressing is used as shown in Fig. 19.8. «There are various routing methods possible for the 3 cube network. They are based on the common bits in thelr ID as shown in Fig. 1.98. ‘These different routing methods are shown in Fig: 19.9(2) 1.9.1(0) Trees + Fig, 199 shows another method of connecting the processing elements in a form of tree. There are various topologies under this, but the one shown in Fig. 1.2.(b) isthe most widely used one —Introdu‘ 1t processing elements hy, further each of the fou ction to Parallel Compu, h q 1-22 nat is connected tO sing el hand so on. at the top t! two differen to another two proces: Jements eact ments €ac! rocessing element ts are connected in this case there is @ P ected to two processing ele two. processing elemen per level and is connected toy, processing elements are conn Hence the degree in this case iste. each element gets connected from one UP ower level processing elements. Here the numberof links is'11~ 1 where nis the number of processing elements. ‘The most attractive feature ofthis isthe diameter increases as 2 logarithmic value. For example the Fig. 1.9.9 shoy, aS level of processing elements connected, but the diameter is just 5. Here we have 2k —1 processing elemen vere kis the number of levels. As discussed the number of levels in the example given in Fig. 1.9.9 is 5. There * connected to two, each connected 19 two other processing elements, and 50 on oe 5 levels means one processor S levels. — pe} ee} (a) Routing by least significant bit, Cy Shee eo (me (b) Routing by middle bit, C, “410 1 (©) Routing by middle bit, C, + Routing algorithms in 3 cube netIntroduction to Parallel Compus's. mance Computing 12 °° 9.9 o : =H —H allo allie 60 00 © Fig, 19.10 : Tree connected network topologies 19.1) Fully-Connected Network © Anetwork where each node intercon ects all nodes of the network is called fully connected network. «+ Here every node has a direct link with all the other nodes of the network. ny connections as shown inthe Fig. 1.9.11 +The drawback is that, it requires too mat Ps . Pe Po Ps Ps Pe Fully-Connected Network Fig. 1.9.1 1.9.1(F) Fat Tree +The factory network is a universal network for efficient communication. It is modified form of the original tree ‘network. + Ina tree network every branch has the same thickness regardless of their place in the hierarchy. 0 " na tree, branches near the top of the hlerarchy are thicker than branches further down the hierarchy in @ * This has increased bandwidth of edges into root direction. ere of network provides more flexibility because practically more traffic occurs towards the root as comparedHigh Performance Computing 192 Introduction to Parallel Computing Pra Pit Fig. 1.9.12 : Fat Treez Dynamic Networks 1.9.2(A) Bus Based Interconnect Network 1. Single Bus Interconnect System Single bus and multi bus are the two bus based interconnect network. This isthe simplest way to connect the multiprocessor systems. It makes use of local caches and hence reduces the accesses between the processor and memory. ‘the sizeof such a system can vary between 2 to 50 processing elements in this system. ‘his kind of interconnection for multiprocessor limits the bandwidth of the bus as only one processor cn access the bus at any given time .e. only one memory access can take place at any given time, The Fig, 1.9.13 demonstrates the single bus interconnection system Fig, 1.9.13 : Single bus interconnect 2. Multi Bus Interconnect System We have already discussed the disad antage of the sin interconnection system is made. ale bus system and hence we will see how the mil ™ ; ti : in the Fig. 19.14, ple processing elements, Some of these aIntroduction to Parallel Computing, Fig. 1.9.14 ; Multi Bus Interconnect System 1.9.2(B) Switch Based Interconnect Network 1 crossbar, single stage and multi st tage networks are the types of switch based interconnect network. Crossbar Interconnect Network 2, Single Stage Interconnect Network 3, Multistage Interconnect Network 1. Crossbar Interconnect Network Crossbar networks provide connections among all its inputs and outputs simultaneously. There are switching elements at the intersection of the horizontal and vertical lines inside the switch, itis said to be a non-blocking network, since it allows multiple input and output connections achievable simultaneously. Fig. 19.15 shows the connections system of a crossbar interconnect network. Mi M2 M3 M4 M5 Mo M7 MB Straight switch setting Diagonal switch setting Fig. 1.9.15 : Crossbar network This network can be used to have one-to-one as well as one-to-many message passing i.e. we can have message passing between one processing element to another or one processing element to multiple elements. Ina crossbar network, every incident packet is prefixed with a tag of the destination of the packet. The packet received by the selector of the input port checks this tag and also checks the status of the corresponding output port. If the output port is free, then the connection is established and the data is transferred. But if the required port is busy then the connection is refused. Hence all the selectors have to work simultaneously. Tech PubiicationsHigh Performance Computing ¢ Introduction to Parallel wa 2, Single Stage Interconnect Network ‘+ The single stage interconnect network has a single switching element between the inputs and outputs or, network. ‘The different possible settings of a2 x 2 switching elements are shown in the Fig. tS Be Se straight ange Upper Lower " Erohand? proadoast broadonst Fig. 1.9.16: Different settings of 2» 2 switching elements in single stage +The above settings operations are clear from the name itself: The straight setting passes ane is 10 ong connection Le. upper one to the upper one and the lower ane to the lower one. The exchange setting swan, the upper and the lower one. The upper broadcast makes @ connection from the WAPer to aon the outa wile lower broadcast provides a connection from lower input to ll the outputs 6. ; x 8 interes ‘A shuffle and exchange single stage network of & inputs and 8 outputs Le. 8 interconnect networks Perfect Switching shufflo elements me Fig. 1.9.17: 8 x 8 shuffle and exchange single stage interconnect network This is again a kind of implementation for the finding FFT. The single stage shuffle exchange network comprises of a perfect shuffle at the input, followed by the singe stage of switching elements and finally by the output that is buffered and feedback to its corresponds inputs. Thus the packets of data can be circulated in the network until they reach and can exit the desired output. 3. Multistage Interconnect Network If this single stage shuffle exchange units are cascaded then it results in multistage interconnection network In this case the data need not be circulating in the network, instead it can directly reach to the required ov! port. There are two types of mulistage interconnect network In case ifthe input and output stage are same the" is called as single sided while ifthe inputs and outputs are different then itis called as two sided. Fig. 1.9.18 shows the single sided multistage interconnection network while the Fig. 1.9.18 shows two 5 multistage interconnect network. - Ht =—- jigh Performance Computing Introduction to Parallel Computing nigh Fig. 1.9.18 : Single sided multistage interconnect network In the single sided multi stage interconnection network, the switching elements and the links are both bidirectional Some of the multiprocessing systems use single sided multistage interconnect network but most of the ‘two sided multistage interconnection network. Fig. 1.9.19 : Two sided multistage interconnection network + Also the multistage interconnection networks can be classified as single path and multipath networks. In case of a single path multistage interconnect network, there exists only one unique path between a input and output port pai. + Incase of multipath there can be.multiple paths between a pair of input and output port. Hence it can be said to be a non blocking communication Le. if one path is not available because of another communication happening between two different ports, another path can be used without blocking the communication. Fig, 1.9.20 : Mapping of a 3 cube network to a multistage interconnect network TechKnowledge Punticationsbe (3 cube) can a to be proper 9.20. be are ans inthe hypercube and the COreSpeng interconnections of a 3 cut isis smulistage inter connection network. THs connect 0, there is proper mapaing shown f°" the c Inthe Fig. 192 connections in the multistage network. using these tne 3 different e network can connection: bold and light lines used of it colours: used and also the oe wows Spain inderstand how this multistag* be made f shown tou rn order to obtain a single stage intrconney, ‘the same connections can be used only upto the first stage i using the 3 cube network. ‘The above process can also be reversed ie. even @ multistage interconnect a network by doing the reverse mapping of the connections. This can also be relat rk can be used a5 23 city from the above diagram, 1.10 _Cache-Coherence in Multi-Processor Systems: “wo main challenges associated with parallel process, one is how much program is going to be parallel and how ‘much of the program is going to be sequential ‘Te second issue is the communication latency. Even if we have very litte ‘communication latency that is going to have a major effect on the process of multi process performance. the data and multiple processes, So instead of having the data in one place ‘© So solution is, can we look at caching and all of them trying to get the data from one place which will lad to communication latency, we can have data in different cache. Caches serves to increase bandwidth versus memory or the bus speed, reduce latency of access and it is going to be valuable for both private data and shared data, But the moment we bring in cache in multiprocessor system, we also have the problem of cache coherence and cache consistency. = | [a=] [==] [==] [om Fig. 1.10.1 Coherence + Allreads by any processor must return the most recently written value. ‘* Writes to the same location by any two processors are seen in th ion) y two pr fe seen in the same order by all processors (serialization). rializati ‘* Im order for cache subsystems to work properly, the CPU Updated copy ofthe reuested nora and the other bus masters must be getting the mostUU 1.29 Introduction to Parallel Computing ance Computing High Perform: main memory may be altered whereas the duplicate There are several cases wherein the data stored in cache or remains unchanged. - Jblem of coherence come up because we have the support for migration and replication as caches ic problem eee ation and replication which allow us for movement of data from one main memory to one cache and ide migratio oe is called migration of data. from main memory to other cache also. This Migration : Movement of data from main memory to cache ° +: Multiple copies of data Replication allowing replication of data which means multiple copies of the same data can be available in different « «Weare caches. 'As we support migration and replication of data, will definitely have the cache coherence problem. To handle the cache coherence problem, we have two hardware solution (cache coherence protocols) 1. Snooping based protocol (Snoopy bus) Each core tracks sharing status of each block. So each cache controller has information about its own cache and a general snooping happens on the bus. So everybody looks at what is transmitted on the bus. 2. Directory based protocol Sharing the status of all the cache blocks which are kept in one place (location). 1.10.1 Snooping Solution (Snoopy Bus) * _ Inease of snooping solution, we make use of the bus which the most common interconnection topology. * Wehave a snoopy bus and we send all requests for data to all processors. Processors snoop to see if they have a copy and respond accordingly. * This requires broadcast, since caching information is at processors. * Itworks well wth the bus which is a natural broadcast medium, * tis useful for small scale machines. Basic approach for snooping 1. Write invalid protocol invalidates them (Multiple readers, single wri , single writer). 2 Write update protocol When a processor writes, it updates other shared copies of that block. When one processor writes, broadcast the value and update any copies that ™ay be in other caches. ee