0% found this document useful (0 votes)
168 views

Decode HPC

Decode

Uploaded by

Kiran Mahajan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
168 views

Decode HPC

Decode

Uploaded by

Kiran Mahajan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 68
Herre —— DECODE HIGH PERFORMANCE COMPUTING (Fer END SEM Exam - 70 Marks) BE. (Compt Engen) Smee - Vi € Coprigh wih Technical Publeotone ‘Al plang righ ried and boc veson rated wi Tech Pbleson ‘oper al book thos ba ered inn om, Eachonc Hechanes, occ «ony nfomaton stooge ond aa pen wiht rr prion nw tem Tacncal leatars, ow Pblshed oy Se TECHNI ‘ori s now ova er PUBLICATIONS fo nownestcron Seuwetme sane son Prater Yon P&B, No. 1/1A.Gh nd Ee, Nd Vi ad, THe Dat "Boe. #13043 Asay sress-ssasanza sme ll oresassesnn o SYLLABUS nigh Performance Computing - [410250] { Creaie [ee Unit IIT Parallel Communication ‘asle Communiation : One-o-All Broadcast, Allto-One Reduction, All- to-All Broadcast and Reduction, AlLReduce and Prefix-Sum Operations, Collective Communiention using MPI : Scatter, Gather, Broadcast, Blocking and non blocking MPI, Allto-All Personalized Communi Ccicular Shit, Improving the speed of some communication operations. (Chapter +3) ‘Examination Scheme : EndSem (TH) Unit IV Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs, Performance Measures and Analysis : Amal’ and Gustaion's Laws, Speedup Factor and Efficiency, Cost and Utilization, Execution Rate and Redundancy, ‘The Effect of Granary on Performance, Scalability of Parallel Systems, Minimum Execution Time and Minimum Cost, Optimal Execution Time, Asymptotic Analysis of Parallel Programs. Matrix Computation : Matrix-Vector Mutipiation, Matix-Matrix Multiplication. (Chapter - 4) w Unit V CUDA Architecture leaden te GP: newton © GPU Antica o, coer cn rar ne CUDA terme Handing Enon, CUDA memory mal, commmncatce ant syostronzatoe, Parle! pogeamming (Caapeer-9) Unit V1 Rligh Performance Computing Applications Scope of Parallel Coapunng, Parallel Search Algerithans : Depa j,, SeeciDFS) Breath Fire SeuchBFS), Parl Sorting : Bai. ere Dred Competing: Deusen lift, Fines, Xebetea GF Aptaton Patel Conptag fc AIML (Chipur-g UDA, w TABLE OF CONTENTS Letts ‘Chapter -3 Parallel Communication (3-1) to (9 - 20) 3.1 One-to-ll Broadcast and Al-to-One Reduction nnn A 3.2 Alito All Broadcast and Reduction. : 3-3 133. AllReduce and Prefix Sum Operation... oo “6 34 AlLto-ll Personalized Communication, 35 Creular Shi sete ‘Chapter-@ Analytical Modeling of Parallel Programs (4-1) to@-31) 441. Understanding the Parallel System... 42. Sources of Overhead in Parallel Programs... 43 Performance Metrics for Parallel Systems... 44 Scalability of Parallel systems 45. Minimum Execution Time and Minimum Cost-Optimal Execution Time... 4 aed 4 4 Trapter=5 CUDA Architecture C-we-w 50 Te Backdrop of CUDA Revolution sea $5.1 What isthe CUDA Architecture? 522 5.2 Using CUDA Architecture 5-3 $3 Applications of CUDA. Sos 5 4 traduction 1 CUDA C 5-7 ‘55 Wete and Launch CUDA C Kernels 5-10 5.6 Managing GPU Memory n-—~ . s-11 5.7 Manage Communication and Synchronization 5-14 5 8 Parallel Programming in CUDA -C. 5-07 PUTA Tiapter-6 High Performance Computing ‘Applications (6-1) 10 (6-40) 6.1 Importance of Sorting in Computers 6-1 62 tssves in Sorting on Parallel Computers 6-2 63 Paralleliing Quick Sort. 6-10 64 Alla Shortest Paths 6-18 65. Algorithm for Sparse Graph 6-0 66 Parallel Depth-Fist Search, 6-0 6:7 Parallel Best - Fist Search. 6-3 Solved SPPU Question Papers 6-1) to 6-60) ™ 3, : One-to-All Broadcast and All-to-One Reduction 1 Explain oneoall rondcast and allio-one reduction ini sa beta nats eth, ms 4 ‘Ans.: + Onell broadcast is the operation in which single proces send identical data to all other processes + Parall algorithms often need this operation. Lets consider that data of size m is to be sent to all the processes «Tilly only the source proces has the data + After termination ofthe algorithm there will be copy of inital data with each process. +p copies of the inital data will be generated where p is the number of processors. Fig. Q.1.1 shows one-to-all broadcast. ‘Many important parallel algorithms ike matrix matrix-vector multiplication, Gaussian elimination, shortest paths, and vector inner product need One-toall broadcast operation 60-,....00-6 Fig. Q.1.1 : One-to-all broadcast and all4o-one reduction op ——— enema comming 82 Pal sneha ne etn wh amg ane <1 Svat andy 22 By pono te esha aoe roc orem myuton 18 the operation 0 WN date fy, ane nk tg destination poe ana eau, produc maxi, oF nim a ret pert yale reduction seh pnp lave aller Mw ona word, 2 het emanation of he bgt te i word offal blr y a i the ltd rest of the su product, MaNmm, van mm ofthe 1 th words ofeach ofthe Buller ly, Q2y Shows al oe rduton. oneso-l boadaay, 6 OO .nrun.O O Fig. 2:1 : One-toull broadcast and all-to-one reduction Esplaln how matrix vector multiplication can be performed sag tae = ie = all onde and all = to. one "educa oe 1 [PP O18, Dre8, Maa) Ana: © Consider x n mesh of nodes. The problem isto multiply ann xm matrix A with an nx 1 vector x to get an Ax 1 result veto y ‘As shown in the Fig, Q31 each element ofthe matrix belongs to 4 ferent process. ‘+The vector is distributed among the proceses in the topmost ow ofthe mesh + Resalt vector y is generated onthe leftmost column of processes ‘+All he rows ofthe matrix must be multiplied withthe vector ‘Every process takes the element of a vector present with the topmost process in respective column. For example, all the processes in the frst column take the vector element fom Process Py ‘For this oneto-all broadcast of vector element is done in each 4™matix wih af "1 vetor ‘ach column of the n xn mesh will be considered as ann - node linear array. ‘Atte the broadcast, each process multiply the element present with i to the broadcasted element. «The result of multiplication obtained by each process is added row wise by performing al-to-one reduction, First process will be the destination of reduction operation, for example for the first Tow process Pp isthe destination proces, 24 Explain algorithm for one - to all broadcast. (SPU Oct-18, Marka ‘Ans: Consider the algorithm for one-to-all broadcast on a 24 snode network. + Following points should be noted before reading the algorithm + Source of broadcast is node 0 4 The procedure is executed at all the nodes. ry. is the label of the nodes. + The variable mask is used to decide which nodes will communicate in the current iteration, + Mask consists of d= log p bits. For example, For 3D. Hypercube of 8 nodes d = log 8 = 3. : + Initially all the bits of mask will be set to 1. So mask © 111 => eeoes ‘Gale forEnginerog Seder Paratet Comm, ounver 1s ued 0 inate the CUNY i, C caw in which communication is taking Plat at the highest dimension, 2 Lab nt an commer The nodes wl For example For 3D Hypercube jon ear i and ose gay! a A ons me abe : | em an ne Sep sees Hag on Feet S TPE 9000, 2010), 4000) and (tio) coi : 5 tg emmniatin al he cn algorithm termunates. the 1 procedure ONE_TO_ALL,BC ( my_id 0) 5 1" Sot alld bits of masieto 1») & fori: =4=1 downto 0 do /* Outer loop */ 5 mask; = mask XOR2!;/* Set bit of mask to 0*/ © my sd AND mask) = Othan /* Tower i bits of my_td are 0+) 2 (ay i AND 2) = 0 thon ® smap_desunation ; = my 4d XOR 2!; 7 ‘end X to meg_destination; _ de : ‘msg_source: = my J XOR 2 -wosiveX from mag_source; dimensional pace att of a message X from node XOR ave bitwise logpar gg hPetcube (d= log p ). AND and respectively, KAND and exclusive-OR operations, =—____ A Guide for Erinn Sle” ugh Performance Computing 9-8 Parallel Communication {Explanation of the algorithm 4.4 «Line 3: As explained previouly mask variable il be vet to 111 in d= 3 502" =1= 1000-001 = 111 # Line 4: For the first iteration |= 9 ‘Line 5 : Considering d # 3, mask = 111 XOR 100, So new value of mask = O11 « Line 6: Consider my_Jd = 000 , 000 AND O11 ~ 0 by this nodes having 0 in their LSB are chosen, «Line 7 | As mentioned earlier the nodes with 0 in their LSB's are chosen. Test of sending and receiving is done in this step. For ‘example, for 1" iteration, 000 AND 100 = 000 Line 8 : msg_destination = 000 XOR 100 = 109. Thus node 0 (000) becomes sender and node 4(100) becomes receiver. ‘stn the 2 iteration for i = 1, nodes 0 (000) and 4 (100) are senders while nodes 2 (010) and 6 (110) are receivers. «Its important to note that this procedure should be executed at ach node and it works only if node 0 is the source of the broadcast. +2 3.2; All to All Broadcast and Reduction Q5 Explain all - to - all broadcast on ring with the help of algorthen. ‘Ans. «In all-toall broadcast operation on a linear array or a ring tach node can send the data to its neighbor this process is Continued in subsequent steps so all the communication links can be kept busy The initial source of each message is identified by the label in parentheses along with the time step. ‘+ The number(s) in parentheses next to each node are the labels of nodes from which data has been received before the current ‘communication step. qo 1 Gulde for Engineering Students 1 ee me Guay 02 egy As the . wes all (p ~ 2) Pieces of information bar ten 1) OP tab Ago etd before reading the algorithn + lene contains intial MeSEage to ‘broadcast. colllecte; «At the end ofthe algoitun all P MESSABSS Ae a ah ode : network the Hear array procedure emmy me ae a a dog te coun posture ALL_TO,ALL, BC_RING (my. my.m8G,P eeu) 2 begin Yet: = (ny $4~ 1) od 5: sight: = (ay id + 1) 0d sen : = my. mag, sag: = re send mg to ight; -wesive mag fom let a “« 6 6 7 fori:=ttop-1do 8 8 10 eeu: = result Umeg: 1 endfor. Zend ALL TO_ALL,BC RING ‘Algom 8:1: AlHto-all broadcast on a p-node ring ———— 3.3: All Reduce and Prefix Sum Operation Ament Su Operation | (26 Explain pets um or acam operation. . A [SPU Octet, Marks 4, Oc-19, Maks 6) * Calculating prefix sums or scan operation also uses same © At start number ny, will be present with node with label k. After termination of the algorithm same node holds sum . + Instead of single number each node will have a buffer or a vector of size m and result will be sum of elements of buffers. ‘Each node contains additional buffer denoted by square brackets to collect the correct prefix sum. ‘After every communication step the message from a node with a ‘smaller label than that of the recipient node is added to the result buffer. ‘© As shown in the Fig. Q.6.1 the contents of the outgoing message is denoted by parentheses in the Fig. Q6.1. (See Fig. Q6.1 on next page) ‘These contents are updated with every incoming message, ‘+ For example after the first communication step, nodes 0, 2, and 4 will not add the data received from nodes 1, 3, and 5 to their result buffers. But, the contents of the outgoing messages for the next step are updated. ‘All of the messages received by a node will not contribute to its final result, some of the messages it receives may be redundant, 3 4 6 fori: = 0t0d-1d0 8. partner : = my_id XOR 2!; 2 ‘send mag to partner, pal BD ed —— coos A Guide for Enginering Seater owes essere ’ Aap (arnt inten teats Fg. 26: prox ums computation for an elghtnode hypercube 34: Allto-All Personalized Communication Q7 Explain all - to - Sacks banpente, call monalied operation with the bey Ans. :+ Let A be n x mati AT 18 matrix, transpose of matrix A wil ® =s—__ “A Guide for Engineine Se gh Peformance Computing 3-7 ___— SAT will have same size as A and ATILI* A UHI for 05415 + Considering 1d row major par bbe mapped onto n processors such ‘one full row of the matrix # Each processor sends a distinct els other processor so this is an examp! ‘communication. ‘For p processes where psn, each process will have n/p rows (n2/p elements. a For finding out the transpose all-torall_ personalized ‘communication of matrix blocks of size n/p x n/p will be done. « Processor Pi will contain the elements of the matrix with indices [i 0) fi 1 fe m= 1) ‘en transpose AT PO will have element (i, 0] , Pl clement [i, 1] and so on. «Initially processor Pi will have element [i,j] and after transpose it moves to Ph « Fig. Q71 shows the example of 4 x 4 matrix mapped onto four ‘processes using one-dimensional rowwise partitioning, rtitioning of array nn matrix can that each processor contains lement of the matrix to every le of all-to-all personalized will have Fig. 7.4 : Alta! personalized communication In tana aed matic unng four procera wn eo 1A Guide for Engineering Sens ‘as shown in the Fig. Q81 every, ‘six node ring: ™ Peter +4 at! Vea a Fig: Abbal pomonand communication on a se-nde ing Thee pees a deed by label xy, is label ofthe nee that orginally owns the message and y is the label of fil destination of the message. For example (0.1) where 0 is te sours node and 1 is destination node. * Te Ibe (bs YI. 92 fn, yn is the message formed by ‘oncatraton of m messages For example ((1).(05}) * italy each node sends all pieces of data as one consolidated toe a 2B ~ 1) 0 one ots neighbors. For example mods Rt? ! Rode O sends consolidated meseege (101), 0g) © A Gude for Engineering Soler® igh Performance Computing 3-1 Parallel Communication (2.9 Explain all to - all personalized communication on mash. ‘Ant.: + For all-toall personalized communication on a mesh APXYP, at each node the group of p messages is formed considering columns of destination nodes 4 As shown in the ig. Q91 for 3 x 3 mesh, each node will have rine mword messages one for each node. 219 1088 y Relea Ranson > aS ea on PaaS @ Ey (GRE ies BRS TAM) RISE orem on ‘bagnning ef ret phawe. Rapes Bes eaearn aa sie man Seaermeasr Fle 004: soa pennant conmunleton on 33 meen * For each node three groups of three messages are formed, ‘The ‘first group contains the messages for destination nodes labeled 0, 3, and 6; the second group contains the messages for odes 1. 4, and 7; and the last group has messages for nodes labeled 2,5, and 8. Rconsy A Guld for Engineering Sader saga romance comptng 3H «afer grouping ex TON ne : [Now in the fist phase al a ‘COTATI atin performed in each ro with each node a Afr ft phase the messages ESE we are apg radenng the rows of destination nodes. ‘sn the second plas sinlarcommurication is carried ou «After completion of second (toi) where 0Sis®. So each nod trom every other node, ie Ena ll - 1 lperonlaed communication on Pn Hypercube = (SPP: Dec, as — ff. Iyperabe can be done by exending the two-dimensional mei algorithn to log p dimension. As shown in the Fig. QUO consider the example of 30 Hypercube + Each node wil ave p packets of size m each + Here the communication takes plac in lop p steps ‘+ The messages are rearranged locally before each step. “ach sep te date i ecanged by pairs of nodes + * We Ino at pode hypercube two subeubes ofp? as ste connected by p/2 links in the same dimension. Amon Ges th P Piles present with each node, 2 pak puttcusr dine nt OM Message will be sent #1 # vei contain luster of messages gf ay phase node i will have the messy le will have a mesap “A Gude for Engineing Sot High Performance Computing 3-13 Parallel Communication (ootem warm pea (20) i 0} en LA oO (3) nase sane ee Bas) 3 wae © (6 5)14.7), Base Bae (austin {(0.0}10,2},(0.4) 10.6) ] (.0}1.21.11.41.11.6)) See (0) Distribution before the second stop “ne for ngiering Seder, Saretial Commie, a High Performance Computing 3-15 Parallel Communicat 3}17.71{8,3) (6, 7 (aiemtea et Ans: An Optimal Algorithm : + On a hypercbe, an al-oall personalized communication becomes optimal if every pair of odes communicate directly with each other. ‘rin every step each node performs p - 1 communications to ‘exchange m words of data with a different node. To avold congestion each node must choose its communication partner in each sep. As shown in the Fig. Q111 and explained in the algorithm QAL1-in the j th communication step, node 4 exchanges data with node (1 XOR j. For example, In step 1: i = O00 (node 0), j = O01 (as communication step = 1). So 000 XOR 001 = 001. Thus node 0 will ayeaied, AlN fataaee Bue. (c) Distribution before the third step coat. rzy 0.07.09 (os). (4a teoutin Fh 104 An ato of meaay Farvonatand on tds Eommuniaton algo Fie 11: Gavan sap in {seal peonalid communication on OF Elta Eady "uta. = “Toa for Engrg Sara SS a A Guide for Engineering S04 5 f0F 1 = 00ino4g « 5) $0 000 XOR 10} sion Step EB th node 5. SAME Process, cany only one message in the «Also bidirectional links dircion sn hyprabe 4 eg where [ TepreSENIS ROM Zero fy sraveling from node i to node 2 fone x08 j open ind . rom noe ito node j should tavene A eamependi to the nonzero bits i « @ X0R wee path of eave can be obtained by soy ee rg ich the message travels in ascending oz nearing 10 tis fst chosen Link will correspond to nex 158 of (HOR 9) 4 Ts otng scheme is known as Ecube routing fis called as the Hamming distance bey, 1. procedure ALL, TO_ALL_ PERSONAL (d, my_id) 2 begin 2 feris= 102-140 4 begin 5 partner: = my 44 XOR 6 sand May Partner to partner; 7 eeeHv0 M pre: yi from paertner; 8 A Guide for Ens High Performance Computing 3-17 Peralet Communication ‘+ As per E-cube routing strategy, as there is no contention with any message traveling in the same direction along. the link between nodes i and j, communication time required is t, + tym for a message transfer. + The total communication time for the operation is LT = (y+temp-)) 3.8: Circular Shift 0.12 Explain ercular shit operation on mesh. Te [SPU © May, Mats 8, ec, Mars 6) ‘Ans. « Circular shift can be applied in some matrix computations and in string and image pattern matching. ‘in circular q-hift node i sends data to node (i+a)mod p in a sroup of p nodes where (0 < q Bao [81s [22a [3 (a Submaticocatons ater fst shit we Bey ea ho a Tas Tas |e Par PENCE (rtenatheton seni han Fig. 13.4 Canon's ago Impomaniton + Th te sot vl pl i wf v 2 to Seam po Dene = OSS => I ft \ High Performance Compating an by The DNS algorithm «This algorithm is known as the DNS algorithm because of fo the originator mathematicians Dekel, Nassimi and Sahni «The matrix multiplication algorithms presented so far use block 2D partitioning of the input and the output matrices and use 2 maximum of 1? processes for nx n matrices. As a result these algorithms have a parallel run time of Win) because there are €@{n’) operations inthe serial algorithm. Since the matrix muliplcation algorithm performs»? scalar ‘multiplications, each of the n* processes is asigned a single scalar ‘multiplication. «The processes are beled according to their location in the array, and the multiplication Aft, k] x Blk, j] is assigned to process Pu (Osi j ken). « Here is the visual representation of DNS algorithm, that depicts the communication steps in the DNS algorithm while multiplying 444 matrices A and B on 64 processes. 1s The shaded processes in Fig, Q132 (@) store elements of the fist row of A and the shaded processes in Fig. Q132 (b) store clements ofthe frst column of B “s The process arrangement can be visualized as n planes of nxn processes each. (See Fig, Q132 on next page. «Each plane corresponds to a different value of k «+ The matrices are distributed among the n processes of the plane corresponding to k = 0 at the base of the three-dimensional process array. Process Ps initially owns Aj and By.» «+ The DNS algorithm has three main communication steps ‘a Moving the columns of A and the rows of B to their respective planes, >. Performing one-to-all broadcast along the j axis for A and along the i axis for B and os Vue for Eninccring Sens Ansistient reat sap erect oe Analytical Modeling of on "Parallel Programs igh Performance Compatng a ‘submatrix. The matrix multiplication algorithm then be rewritten 1s follows. 1. procedure BLOCK MAT MULT (A.8.0) 2 bein fori: = 0toq-1do for} =0 toq-1do beat. Initialize all eloemente of C, to 2070; for k:= Otoq-1do (econpeningdarbton of 8 (am teen 9 ae Fg. 132 ONS algorthm « Allto-one eduction along the k axis. «All these operations are performed within groups of n processes se Ae Sg 2, The peal dine ‘nx n matrices using the DNS algorithm on 1° processes is (log n). 7 ©) Block matrix operations A concept that is useful in matrix mul “ plication anety of other matrix algorithms is tht of bigy o operations. matrix ‘For example, an n x n matrix A can be can be regarded as a of locks Ay (051 j A Gade for Enneig Seder 57. CUDA Ary, ction to CUDA C eoremning /€O0P ent, components are: Lets progasinng, oang code in CUDA opi proces = 08 NUIDIA ten EN of PES MEDOTY Some 4. NVS 1 6. Tegra 12 An NVIDIA device driver that allows your Programs [NVIDIA provides system softrare to communicate with thy (CUDAenabled hardware, S.A CUDA development lit If you have « CUDAnables AG and NVIDIA device driver, you are ready to run system of choice onpiled CUDA C code Sele the operating to complete the installation. “tA standard C compiler; It includes a compiler for GPU code ‘tds comple for CPU code. the CUDA Toolkit as suggested im the previous secon, is already installed, it includes 2 compiler for GPU code Forte CPU code compilation, based on the operating system for CPU, the instalation is needed. (29 Explata : How CUDA provides optimized performance tank pcpertion between CPU and GPU. me (OR Explain tak exaction igion met! CUDA wth ogra. Ale (OR Explain CUDA task execution model (OR How paralllam can be acleved through GPU ? ‘ss. ; Aatony of « CUDA C+ Program Sed enced onthe CPUGPU bse yy Nn CUDA 1 Serial cde executes ina Host (CPU) tend 'ecseats sear = SS ff “CUDA architecture s PU) threads Streaming gn Performance Computing ‘parallel code executes in. many concurrent Device( Moss multiple parallel processing elements |e ‘Multiprocessors (SM), ‘a The Fig, Q91 shows the graphical illuste st CPU ani ation of the serial and id GPUs) respectively’ parallel code execution on the Ho allel code Fig. 29:1 Graphical ilustration of the serial and par ‘execution ‘At this point it is important to understand that in a parallel execution environment, there can be minimum one CPU and one Gr more GPUs existing in the system and are marshalled by CUDA architecture. Refer Fig. Q92 ‘Queue of waiting blocks Mute blocks eunning on each SM = == = = t om | Fig. 0.9.2 Graphical ‘each xo representation of multiple blocks running sceuming meliprceeeer Mn A Guide for Exgincering Sade CUDA Archi, ’ porn COMPO = sebeve paraitism using CUDA we want 19 the serial as well paralle) code boul oni C program OA gr: mar: mechasiom for the Hast ang executes at the fh bet and parallel computin, by following tiy eo gh Peformance Computing 5 CUDA Arch 2. CPU thread is initialized to execute the serial code and GPU thread is initialized for parallel code. 4, GPU thread copies data for parallel processing ffom main memory to GPU memory. ‘4 CPU initiates the GPU compute kemel and hands over the control to GPU. 5, CPU continues with serial code. 6 GPUs CUDA cores execute the kernel in parallel 7, GPU thread copies the resulting data from GPU memory to ‘main memory. 5: Write and Launch CUDA C Kernels} 411 Write and explain 2 simple CUDA C kernels. (OR Write a short note on CUDA C kemel functions. (OR Describe how nvee understands CUDA C functions. (OR Write a short note on CUDA kernels. Also explain Kernel call syntax. ‘Ans. : # Let’ have a look at our first CUDA C code below include _global__vold kernel (vod) { nt main (void) { komelc>>0: print (allo, world" return 0; , This program has two important additions to our original "Hello World !* program 1. An empty function named kernel qualified with _gobal__ es “A Guide for Emerg Students 2 ped on the nvee compl is compiled o” PEE on the i, uae ales the compile that 2 ann device ie. GPU instead oF theo the host compiler a8 it ag , r previous se ten eel ee completa anes device, ceeyo sees 10 te parameters that infiuence how the wil be launched at runtime. 156: Managing GPU Memory| ‘ct Wite «short note on managing GPU memory, PU: May, Mak 8, v9, Ka (OR Waite a abort note on CUDA memory management fun (OR Exltathe CUDA mechan to move data from CPU to cp nd ice a vers. (OR Exlla the memory handling in CUDA with memory oe respect to Shar: (OR Explain the memory handing In CUD, (Ela mene mangement in CUDA. ae CPU from CPU code to kernel exde Parameters ae to be pase To undersan cae a he Puameerpasing tthe device ie. GPU, ke made loving ehcenert 10 our “Hell, Wo perform simple 2 + 7 on dey ‘ecade ce. (Le. GPU) ‘acide Book bald ade. . ‘eeasb, tb in) ( A Guid for Enincering Su addi, 22>> (27, d0v.0) RANDLE. BRROR (cudaMom0—PY devs size (iat odaMamopyDevicsToHost)) print (2 + 7 = en ccudaFroe (dev) rotum 0 y eens here, but these changes introduce onl You will notice two concepts nee re would with any + We can pass parameters to a kemel as we Wo kernel function. ‘function. In this code, add is @ ture is othing spcal about passing parameters 1 KET! TREN jglrbrackt syntax not withstanding, 2 Kemal call 100kS ‘and acts exaclly lke any function call in standard C. “the runtime system takes care of any complexity introduced by the these parameters need fo get from the host to the device. «Using CUDA C function we allocate memory on the device using ‘cudaMalloc). The CUDA runtime allocates the memory on the device as per the parameter values passed to it. The first argument is a pointer to the pointer you want to hold the address of the newly allocated memory, and the second parameter is the size of the allocation you want to make ‘typically to store the result value of the kernel function add). 13 Explain CudaMalloc function with its limitations. (OR What are the limitations of CudaMalloc functions ? J 1 Guide for Engineering Students 5-8 CUDA Arb compan ene ej behavior to malloc’). However there , ff device pointers us. allocated with cudaMallocy ‘One can pass POINT A vice (ie. GPU) asocated with cadaMalloc)to rey, vende that executes on the device i, 3 g & i rere alloested with cudeMalo(y, 3. One nr an he os (ie. GPU). fan is piers axed with cual) tad One cama ey om code tat eneues om IME ost GPU} located with cudaMalloe), we need tc “tmp at cuDA. 214 Explain the CudaMemcpy functiont 19 Ott a oe as '« Memory on a device can ‘be accessed through calls to a Mer aos coe, Thee cals behave exacly ic cabaret aan paneer 10 sey sar Ce cation pers oi 10 devi: our ae ole cae) ncn a ae nt tat est panes 1 cdaQercy( 1 cuaenep/DevieToHont, nracig the anime tht seen i'r device poser dv) and te deiaon posta Rost poster) a ion eaMenepHosToDevx) lel the oppose raonfre source daa on the oot and. te evant es adv onthe dc 4. We can even secy Gut oth poses we on the device by ing eidaMerepDevieTDeve) 4. If the source and destination pointers are both on the host, we can simply use standard C's memepy() routine to copy between ten an os “A Gil for Enioering Srdene, (CUDA Architecture Fig. Q.14.1 Graphical representation of cudaMemepy functions 5.7 : Manage Communfcation and Synchronization | 15 Write = difference between CPU and GPU in the CUDA context. ‘OR Explain the concept of multi-cpu and multi-gpu with sultable ‘example. ‘Ans. : How CPUs and GPUs are different ? GPUs and CPUs are architecturally very diferent devices. Sr. cru No. | CPUs are designed for running| GPUs are designed for running | a small numberof potentially” | a large number of quite simple tasks Gru auite complex tasks. Z| The CPU design ie aimed at | The GPU design is aimed at systems that execute a rumber | problems that ean be broken. of discrete and unconnected | down into thousands of tiny tasks. fragments and worked on individually + Thus, CPUs are very suitable for running operating systems and application software where there are a vast variety of tasks 2 ‘computer may be performing at any given time. Seo “A Gale or Emini Sudens CUDA Arhitey, ocr compint consequent SUPP rt threads i Very diffeny, gh Perfor cPUs and GPUE we at neo CUDA HA EDR kernel functions on Witte ® egg cds CUDA ere «Parallel portion ofthe , : anucoreanl ‘of CUDA threads efficiently Kee ed nt cs 1 ts _ : 1s of parallel blocks represent cifrent level «Threads and blocks &7 cpt a7 ee Cone communion not, Mot ‘ans: CUDA communication and synchronization £ © CUDA is ideal for an parallel problem, where little or no Se nd ommaatn wh expt primis angen dup sues cane ommuncaten i ony supported by invoking lle ere in er DLE enory ca a be peed in resoed wy ough fomie spartan fo rom gal menery s cadaDevicSyncronie; ot cudaDevieRtet); ae the two CUDA C functors wed Yo cuorize device memory contents 218 Explain: how CUDA ecto purl cade SMe (Smet ‘Ans, : How CUDA executes parallel code © CUDA splits problems into grids of blocks, each multiple threads. onaining The blocks may run in any order. =e <= "Ga Ener ae } Lu gh enformance Compating CUDA Auhitecare «Only 0 subset ofthe blocks will ever execute at any one point In time. +A block must execute from start to completion and may be run ‘on one of N SMs (Symmetrical Multiprocessors) « Blocks are allocated from the grid of blocks to any $M that has free slots. « Initially this is done on a round-robin basis #0 equal distribution of blocks «For most kernels, the number of blocks needs to be in the order of eight or more times the number of physical SMs on the GPU Q.49 What Is CUDA wrap ? (OR Wirlte & phort note on CUDA wrap. OR What Js a Wrap ? Explain GPU utilization with respect to wrap tlze In CUDA. OR Write © short note on each In the context of parallel ‘programming on CUDA C : Wrap. ‘Ans. : CUDA wrap : + Symmetrical multiprocessors in CUDA, do rot give the threads directly to the execution resources ich SM gets an ‘ona Tay] [a] [nm] [We] [Win] [are Ta N werd |LNee | LLIN] Lee Fig. ‘Instead it will try to divide the threads in the block again into Warps(32 threads each). Hale for Englcering Sade co ‘7 CUDA Arc prema OO peformsne COMPA ee apne oye block exit SIMD execution, ‘The Warps ‘ ie Dats} resto any tread i 8 AEP, SM si, ee ays hae seme wrk 1 do, “te rer of 4K active 06845, On Fp, 351596 treads i total 24 This is usually €MOURH to coe (OR Write sort ote ot paz Tad i aflame lt of pall Pf7ID pan tame the concept of thread in. mullic programming Thread is actualy alight weight process time «Malt aching done by evecuting multiple threads ata time 4 Thus threads bring concurency inthe application + For example consider the following function wold ealeulateCubetvd) { sae for (it i=0<5123+-+) { ali bp) y that handles one task 2 ) 4 This sa simple C function for calculating cube of th numbers in array bf ]. On CPU, a single for loop will ae achieve the desired result. a =—=--—«\KqKoo—— <= “Galo neg ane \ @y ‘ompating 5 CUDA Architecture gh Po ‘Instead, in CUDA 512 threads can be launched, each of one calculating cube of an element «Calculation is possible because there is no dependency between the iterations of the loop ‘stn CUDA, this mechanism is programmatically easy. Simply by transforming the calculateCube(void) as a Kerel function, the ‘cube calculation executes only on GPU by launching. parallel thveads. «In CUDA, the thread information is provided in a builtin CUDA taable threadIdx. Instead of referring the structure each time wwe will create a thread variable named UD and assign the value bof thread ID from the thread stracture to this variable iobal_void calulteCubo(in “const aconst nt *constb) ‘ const unsigned it UD =threadldxcx, tu} =BreD "bt: ) 021 Dilferentiate between CPU threads and CUD threads ‘Ans. : Some important points to note about threads Threads is the smallest unit in Parallel programming so in cuba. Thread ID is nique within a block «All threads execute same sequential program. Threads execute in parallel « Diference between CUDA threads and traditional CPU threads se. (CUDA Threads (CPU Threads No. | | CUDA threads are tremely | CPU threads are also fighereight and have less | lightweight processes but have creation overhead than CPU | more creation overhead than threads CUDA threads Z| Thvead switching i instant. | Thread switching is not much efficent. << “oie for Engng Suds CUDA Archite, sane | Mal core CPU makes sey vey lage threads. ce ace ST | few tres 0 Tig PP: Deto22, Mars 9) this addition can “by old ad int 76, fn td = 0; // tis CPU woe (id ) = CPU core 2 ‘old add int *a, int *b, int cH int td = 1; while (ti < N) { cid] = atid) + biti tid += 2; y , 0.25 Explain the structure of CUDA block and thread with suitable (OR Explain blocks in CUDA. 1) CUDA threads lt) CUDA blocks. TER [PFU : Dec 22, Murs 5] ‘Ana. : #1In CUDA, threads are organized into blocks. «# There is a limit to the number of threads per block. +A block thread executes on a single Streaming Multiprocessor (M. « Threads within a block can cooperate and provide light-weight synchronization and data exchange. On current GPUs, a thread block may contain up to 1024 threads. OR Write a short n 1 Guide for Engineering Suen co th kemele<21>9>0, creating 4, them in parallel. Each of. ons are a boo a rca, wold 6 26 ek ein, + With kerele< 2564 is an cach in the context of par, cat Wie shat Oe ahh Ye Aue "CUDA C allows 10 deine & BPOUP of BT°GKS in, It a ae cor problems with two-dimension, emains, ach as matix math of Age POSS The set colection of paral blocks is rofered €9 36 2 gc «Gai sieis defined by the mabe of blocks in Eid *Gnd ‘A collection of thead blocks. Thread blocks of a gre execute sross multiple SMS « Thread block in a grid do not synchronize with each other + Communication between blo is expensive (225 Wie a short note on exch tt the contest of paral Progamming en CUDA Co) Gad ¥) Wrap. (OR Write & short note on : CUDA Grid. 1 [SPU :Dec-22, Maks §] ‘Ans Grids and Blocks ; + When we call a kemel using the instruction we automatically define 2 dim3 type variable defining the number of blocks per grid and threads per block. ‘+n fact, grids and blocks are 3D arrays of blocks and threads respectively, This is evident when we define them before calling 1 kernel, with something ike this im3 blocksPerGrid(612 1,1) im3 threadsPerBlock(512 1.1) karnl<< >>() ay “A Gide forEsneeg Suen \ gh Peformance Computing sen yp Peers Cores __§ # Let's have a look at the code nt row = blockldxy * blockDimy + threadiexy, nt col = blockldxx * biockDima.x + threadidx x + As you can see, it's similar code for both of them. In CUDA, blockléx, blockDim and threaddx are builtin functions with members x, y and z. They are indexed as normal vectors in C++, so between 0 and the maximum number minus 1. For instance, if we have a grid dimension of blocksPerGrid = (512 1, 0) Dlockldx.x will range between 0 and S11 4 The total amount of threads in a single block cannot exceed 1024 0.26 Write CUDA C code for summing two vectors on GPU using ‘iterate threads, ‘Ans. : Full Code for matrix multiplication ovary ‘ifadet _DEV_ARRAY H_ ‘#define DEV _ARRAY H_ CUDA Architecture ‘include include ‘include template clase dev_array « 1) ple functions public ‘explicit dev_aray() star_(0), ond_(0) o J eonstructor ‘explicit dev_array(size_t za) Gee for Enginering Students << 1p re memory on te device void fret) l it (sart_'* 0) ( cudeFre(star.) start_= end 0) , Tsar Tends enait END... ee —— || High Performance | Computing Applications [a 6.1 : Importance of Sorting in Computers {a1 What le the importance of voting algorithm ? (OR Write a short note onimportance of vortingalgorthms. OR What are_the types of sorting algorithm with reference 10 emory unsge ? Explln with example. ‘Ans. :« Sorting is one of the most common operations performed by a computer. « Sorted data are easier to manipulate than randomly-ordered data, many algorithms require sorted data «+ Sorting is of additional importance to parallel computing because of its close relation to the task of routing data among processes, whichis an essential part of many parallel algorithms + Many parallel sorting algorithms have been investigated for a varity of parallel computer architectures. « Sorting algorithms are categorized as internal or extemal In internal sorting, the number of elements to be sorted is small enough to fit into the process's main memory. + External sorting algonthms use auxiliary storage (ouch 0s tapes and hard disks) for sorting because the number of clement to be sorted i too large t fit into memory. + Sorting algorithms can be categorized as comparison-based and rnon-comparison-based. on yr High Peon, Gompating Ain igh Peormance = pronase — 4 Computing pation’ Ere nthe gust for at sing meteds nants of have been designed that sort n elements in time setprancly smaller than €n log n) a rig orks ae Deed on « imparvn network Tae ip which many comparson operators ae performed simultaneously. Columns of comparators a Ee) — 0 a T Ay ~~ 9 | Fig. QA A typleal sorting network | se depth of a network isthe number of cohumas i contains se the speed. of a comparator ie technology-dependent mms post ommag ag ear en | ainatant, the speed ofthe network is proportional to its depth. sre ve + Atypial sorting network Every sorting network s made up of « Papeete ao ‘columns and each column contains a number of 7 ate bidirectional, then stors connected i a. the commanaton coo «compart cpeuton | gy te in paral 5 Explain the concept of comparators on the sorting networks. dns. + « The key component ofthese networks is a comparator +A comparator is a device with two inputs x and y and fo (tp). (04 Wite sort note on sorting networks. | OR Explan sorting network with suitable dlgram, ea | outputs x’ and y’ ‘M, Rerts 7] | ‘Foran increasing comparaton@). x = ints J and = Taupe | =max{x yh 7 ens > es “Sanat Ferns es : }— em vee , (An eran ompartor : [ee Omar, at Ye min) 9—O— yeni) oH (A ecraag anpatr Fla. 08:1 A schematic representation of comparator *Fora TAL Seceastg compantor®) x = may and y' = mins) Schematic representation of comparators : a) An Deeasing comparator and (b) A decreasing comparator. as semi = 8 meertn a babe sr, ely “ cxample A (SU: hy 19, ars Write short note on bubble sort with odd - even ‘transportation. ‘Ans. : « The odd-even tr anos algorthn osm elements Fates (38 even, each of which require nape 8 phos oe conpareexchanye det ay ag, yay » be the sequence to be sorted, fee + Dn en ph ee hen wg, sng a ep (82, a3), (a4, a5), ... (a, 1-2" @y) are comy are tin 4 oe i eons 2p Pefemance Bo Compu pcs popes ee ea i We eee 3 es 8 1 4 3 2 Prase2 even) sss 8 1 6 4 wi Phase 3 (38) ps sf 1 Ba Phase (ven) p33 1 8 4 8 8 Phase ote) pope sy ben 8 Phase 6 (ven) ts ee Phase (ot) +23 3 4 5 6 8 Phase 8 even) 45 6 8 Sorted Fig. 8:1 Sorting n = 8 elements using odd-aven transportation « After n phases of odd-even exchanges the sequences sorted a Bach phase of the algorithm (either odd or ever) requires 1) comparisons and there are 2 ttl of m phates sequential complexity is @(n")- eee —— 1203 3 thus, the 7 Explain odd - parallel formulation ‘Ans. : «Iti easy to parallelize odd-even transposition sort. «+ During each phase ofthe algorthn, compareexchange operations on pairs of elements are performed simultaneously « Consider the one-lement-per-process cast + Let m be the number of processes (aso the number of elements to be sorted). ‘+ Assume that the processes are arranged in 2 one-dimension! amy. Element ai initially resides on process F for i= 1,2. ‘During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. « Similarly, during the even phase, each process with an even labe] compareexchanges its element with the element of its right neighbor. co Compsting aml Prermance onesie Appt gh Performan : - allel formulation of od2-even tansposs ‘so obtain a costoptimal parallel formulation we use fewer Let p be the number of processes, where p

You might also like