Decode HPC
Decode HPC
Bao [81s [22a [3
(a Submaticocatons ater fst shit
we
Bey
ea
ho
a
Tas Tas |e Par
PENCE
(rtenatheton seni han
Fig. 13.4 Canon's ago Impomaniton
+ Th te sot vl pl i wf
v 2
to Seam
po Dene
= OSS
=> I
ft \
High Performance Compating an
by The DNS algorithm
«This algorithm is known as the DNS algorithm because of fo the
originator mathematicians Dekel, Nassimi and Sahni
«The matrix multiplication algorithms presented so far use block
2D partitioning of the input and the output matrices and use 2
maximum of 1? processes for nx n matrices. As a result these
algorithms have a parallel run time of Win) because there are
€@{n’) operations inthe serial algorithm.
Since the matrix muliplcation algorithm performs»? scalar
‘multiplications, each of the n* processes is asigned a single scalar
‘multiplication.
«The processes are beled according to their location in the array,
and the multiplication Aft, k] x Blk, j] is assigned to process Pu
(Osi j ken).
« Here is the visual representation of DNS algorithm, that depicts
the communication steps in the DNS algorithm while multiplying
444 matrices A and B on 64 processes.
1s The shaded processes in Fig, Q132 (@) store elements of the fist
row of A and the shaded processes in Fig. Q132 (b) store
clements ofthe frst column of B
“s The process arrangement can be visualized as n planes of nxn
processes each. (See Fig, Q132 on next page.
«Each plane corresponds to a different value of k
«+ The matrices are distributed among the n processes of the plane
corresponding to k = 0 at the base of the three-dimensional
process array. Process Ps initially owns Aj and By.»
«+ The DNS algorithm has three main communication steps
‘a Moving the columns of A and the rows of B to their respective
planes,
>. Performing one-to-all broadcast along the j axis for A and
along the i axis for B and
os
Vue for Eninccring SensAnsistient
reat
sap erect oe
Analytical Modeling of
on "Parallel Programs
igh Performance Compatng a
‘submatrix. The matrix multiplication algorithm then be rewritten
1s follows.
1. procedure BLOCK MAT MULT (A.8.0)
2 bein
fori: = 0toq-1do
for} =0 toq-1do
beat.
Initialize all eloemente of C, to 2070;
for k:= Otoq-1do
(econpeningdarbton of 8
(am teen 9 ae
Fg. 132 ONS algorthm
« Allto-one eduction along the k axis.
«All these operations are performed within
groups of n processes
se Ae Sg 2, The peal dine
‘nx n matrices using the DNS algorithm on 1°
processes is (log n). 7
©) Block matrix operations
A concept that is useful in matrix mul
“ plication
anety of other matrix algorithms is tht of bigy o
operations. matrix
‘For example, an n x n matrix A can be
can be regarded as a
of locks Ay (051 j A Gade for Enneig Seder57. CUDA Ary,
ction to CUDA C
eoremning /€O0P ent,
components are: Lets
progasinng, oang code in CUDA
opi proces = 08 NUIDIA
ten EN of PES MEDOTY Some
4. NVS
1
6. Tegra
12 An NVIDIA device driver
that allows your Programs
[NVIDIA provides system softrare
to communicate with thy
(CUDAenabled hardware,
S.A CUDA development lit If you have « CUDAnables
AG and NVIDIA device driver, you are ready to run
system of choice
onpiled CUDA C code Sele the operating
to complete the installation.
“tA standard C compiler; It includes a compiler for GPU code
‘tds comple for CPU code. the CUDA Toolkit as suggested
im the previous secon, is already installed, it includes 2
compiler for GPU code Forte CPU code compilation, based on
the operating system for CPU, the instalation is needed.
(29 Explata : How CUDA provides optimized performance
tank pcpertion between CPU and GPU. me
(OR Explain tak exaction
igion met! CUDA wth ogra. Ale
(OR Explain CUDA task execution model
(OR How paralllam can be acleved through GPU ?
‘ss. ; Aatony of « CUDA C+ Program
Sed enced onthe CPUGPU bse yy Nn CUDA
1 Serial cde executes ina Host (CPU) tend
'ecseats sear
= SS
ff
“CUDA architecture
s
PU) threads
Streaming
gn Performance Computing
‘parallel code executes in. many concurrent Device(
Moss multiple parallel processing elements |e
‘Multiprocessors (SM),
‘a The Fig, Q91 shows the graphical illuste
st CPU ani
ation of the serial and
id GPUs) respectively’
parallel code execution on the Ho
allel code
Fig. 29:1 Graphical ilustration of the serial and par
‘execution
‘At this point it is important to understand that in a parallel
execution environment, there can be minimum one CPU and one
Gr more GPUs existing in the system and are marshalled by
CUDA architecture. Refer Fig. Q92
‘Queue of waiting blocks
Mute blocks eunning on each SM
= ==
=
=
t
om |
Fig. 0.9.2 Graphical
‘each
xo
representation of multiple blocks running
sceuming meliprceeeer Mn
A Guide for Exgincering SadeCUDA Archi,
’
porn COMPO
= sebeve paraitism using CUDA
we want 19 the serial as well paralle)
code boul oni
C program
OA gr: mar:
mechasiom for the Hast ang
executes at the
fh bet
and parallel computin,
by following tiy
eo
gh Peformance Computing 5 CUDA Arch
2. CPU thread is initialized to execute the serial code and GPU
thread is initialized for parallel code.
4, GPU thread copies data for parallel processing ffom main
memory to GPU memory.
‘4 CPU initiates the GPU compute kemel and hands over the
control to GPU.
5, CPU continues with serial code.
6 GPUs CUDA cores execute the kernel in parallel
7, GPU thread copies the resulting data from GPU memory to
‘main memory.
5:
Write and Launch CUDA C Kernels}
411 Write and explain 2 simple
CUDA C kernels.
(OR Write a short note on CUDA C kemel functions.
(OR Describe how nvee understands CUDA C functions.
(OR Write a short note on CUDA kernels. Also explain Kernel call
syntax.
‘Ans. : # Let’ have a look at our first CUDA C code below
include
You might also like