0% found this document useful (0 votes)
24 views

Advanced OpenACC Course Lecture2 Multi GPU 20160602

The document discusses an advanced OpenACC course that covers profiling and optimizing OpenACC code as well as advanced multi-GPU programming using MPI and OpenACC. The course agenda includes using MPI for inter-GPU communication, debugging and profiling MPI and OpenACC applications, and reducing parallel overhead.

Uploaded by

amrithp1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Advanced OpenACC Course Lecture2 Multi GPU 20160602

The document discusses an advanced OpenACC course that covers profiling and optimizing OpenACC code as well as advanced multi-GPU programming using MPI and OpenACC. The course agenda includes using MPI for inter-GPU communication, debugging and profiling MPI and OpenACC applications, and reducing parallel overhead.

Uploaded by

amrithp1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

ADVANCED OPENACC COURSE

Lecture 2: Advanced Multi-GPU Programming, May 26, 2016


Course Objective:

Enable you to scale your applications on


multiple GPUs and optimize with profiler tools

2
May 19: Advanced Profiling of OpenACC Code

Course Syllabus May 26: Office Hours


June 2: Advanced multi-GPU Programming with
MPI and OpenACC

Recordings: 3
https://developer.nvidia.com/openacc-advanced-course
ADVANCED MULTI-GPU PROGRAMMING
WITH MPI AND OPENACC
Lecture 2: Jiri Kraus, NVIDIA
MPI+OPENACC
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory


GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e


Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

5
MPI+OPENACC
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory


GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e


Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

6
MPI+OPENACC

//MPI rank 0
#pragma acc host_data use_device( sbuf )
MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);

//MPI rank n-1


#pragma acc host_data use_device( rbuf )
MPI_Recv(rbuf, size, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

7
Using MPI for inter GPU communication
Debugging and Profiling of MPI+OpenACC apps
Agenda Multi Process Service (MPS)
Decreasing parallel overhead

8
Using MPI for inter GPU communication

9
MESSAGE PASSING INTERFACE - MPI

Standard to exchange data between processes via messages


Defines API to exchanges messages
Point to Point: e.g. MPI_Send, MPI_Recv

Collectives: e.g. MPI_Reduce

Multiple implementations (open source and commercial)


Bindings for C/C++, Fortran, Python, …

E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

10
MPI - SKELETON
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank,size;
/* Initialize the MPI library */
MPI_Init(&argc,&argv);
/* Determine the calling process rank and total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
/* Call MPI routines like MPI_Send, MPI_Recv, ... */
...
/* Shutdown MPI library */
MPI_Finalize();
return 0;
}
11
MPI
Compiling and Launching

$ mpicc -o myapp myapp.c


$ mpirun -np 4 ./myapp <args>

rank = 0 rank = 1 rank = 2 rank = 3


myapp myapp myapp myapp

12
EXAMPLE: JACOBI SOLVER

Solves the 2D-Poission Equation on a rectangle


𝟐 +𝒚𝟐 )
∆𝒖 𝒙, 𝒚 = 𝒆−𝟏𝟎∗(𝒙 ∀ 𝒙, 𝒚 ∈ Ω\𝜹Ω

Periodic boundary conditions


Domain decomposition with stripes

13
EXAMPLE: JACOBI SOLVER
Single GPU

While not converged

Do Jacobi step:

for (int iy = 1; iy < NY-1; ++iy)

for (int ix = 1; ix < NX-1; ++ix)

Anew[iy][ix] = - 0.25f*(rhs[iy][ix] - ( A[iy][ix-1] + A[iy][ix+1]

+ A[iy-1][ix] + A[iy+1][ix]) );

Copy Anew to A

Apply periodic boundary conditions

Next iteration
14
EXAMPLE: JACOBI SOLVER
Multi GPU

While not converged

Do Jacobi step:

for (int iy = iy_start; iy < iy_end; ++iy)

for (int ix = 1; ix < NX-1; ++ix)

Anew[iy][ix] = - 0.25f*(rhs[iy][ix] - ( A[iy][ix-1] + A[iy][ix+1]

+ A[iy-1][ix] + A[iy+1][ix]) );

Copy Anew to A

Apply periodic boundary conditions and exchange halo with 2 neighbors

Next iteration
15
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{

6/2/20 16
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{

MPI_Sendrecv(&A[iy_start][1], NX-2, MPI_DOUBLE, top, 0,


&A[iy_end][1], NX-2, MPI_DOUBLE, bottom, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

6/2/20 17
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{

MPI_Sendrecv(&A[iy_start][1], NX-2, MPI_DOUBLE, top, 0,


&A[iy_end][1], NX-2, MPI_DOUBLE, bottom, 0,
1
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

6/2/20 18
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{

MPI_Sendrecv(&A[iy_start][1], NX-2, MPI_DOUBLE, top, 0, 2


&A[iy_end][1], NX-2, MPI_DOUBLE, bottom, 0,
1
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Sendrecv(&A[(iy_end-1)][1], NX-2, MPI_DOUBLE, bottom, 1,


&A[(iy_start-1)][1], NX-2, MPI_DOUBLE, top, 1,
2
MPI_COMM_WORLD, MPI_STATUS_IGNORE); 1

6/2/20 19
HANDLING MULTI GPU NODES
GPU-affinity

#if _OPENACC

acc_device_t device_type = acc_get_device_type();

if ( acc_device_nvidia == device_type ) {

int ngpus=acc_get_num_devices(acc_device_nvidia);

int devicenum=rank%ngpus;

acc_set_device_num(devicenum,acc_device_nvidia);

acc_init(device_type); Alternative (OpenMPI):


int devicenum = atoi(getenv("OMPI_COMM_WORLD_LOCAL_RANK"));
Alternative (MVAPICH2):
#endif /*_OPENACC*/ int devicenum = atoi(getenv("MV2_COMM_WORLD_LOCAL_RANK"));
20
Debugging and Profiling of MPI+OpenACC apps

21
TOOLS FOR MPI+OPENACC APPLICATIONS

Memory checking: cuda-memcheck


Debugging: cuda-gdb
Profiling: nvprof and the NVIDIA Visual Profiler (nvvp)

6/2/20 22
MEMORY CHECKING WITH CUDA-MEMCHECK

cuda-memcheck is a tool similar to Valgrind’s memcheck

Can be used in a MPI environment

mpiexec -np 2 cuda-memcheck ./myapp <args>

Problem: Output of different processes is interleaved OpenMPI: OMPI_COMM_WORLD_RANK

Solution: Use save or log-file command line options MVAPICH2: MV2_COMM_WORLD_RANK


mpirun -np 2 cuda-memcheck \

--log-file name.%q{OMPI_COMM_WORLD_RANK}.log \

--save name.%q{OMPI_COMM_WORLD_RANK}.memcheck \

./myapp <args>

6/2/20 23
MEMORY CHECKING WITH CUDA-MEMCHECK

6/2/20 24
MEMORY CHECKING WITH CUDA-MEMCHECK
Read Output Files with cuda-memcheck --read

6/2/20 25
DEBUGGING MPI+OPENACC APPLICATIONS
Using cuda-gdb with MPI Applications

Use cuda-gdb just like gdb


For smaller applications, just launch xterms and cuda-gdb

mpiexec -x -np 2 xterm -e cuda-gdb ./myapp <args>

6/2/20 26
DEBUGGING MPI+OPENACC APPLICATIONS
cuda-gdb Attach

if ( rank == 0 ) {
int i=0;
printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(),name);
while (0 == i) { sleep(5); }
}

> mpiexec -np 2 ./jacobi_mpi+cuda


Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and one Tesla M2070 for
each process (2049 rows per process).
rank 0: pid 30034 on judge107 ready for attach
> ssh judge107
jkraus@judge107:~> cuda-gdb --pid 30034

6/2/20 27
DEBUGGING MPI+OPENACC APPLICATIONS
CUDA_DEVICE_WAITS_ON_EXCEPTION

6/2/20 28
DEBUGGING MPI+OPENACC APPLICATIONS

With CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 core dumps are generated in


case of an exception:
Can be used for offline debugging

Helpful if live debugging is not possible, e.g. too many nodes needed to reproduce

CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION: Enable/Disable CPU part of core


dump (enabled by default)
CUDA_COREDUMP_FILE: Specify name of core dump file
Open GPU: (cuda-gdb) target cudacore core.cuda
Open CPU+GPU: (cuda-gdb) target core core.cpu core.cuda
6/2/20 29
DEBUGGING MPI+OPENACC APPLICATIONS
CUDA_ENABLE_COREDUMP_ON_EXCEPTION

6/2/20 30
DEBUGGING MPI+OPENACC APPLICATIONS
CUDA_ENABLE_COREDUMP_ON_EXCEPTION

6/2/20 31
DEBUGGING MPI+OPENACC APPLICATIONS
Third Party Tools

Allinea DDT debugger


Rogue Wave TotalView

6/2/20 32
PROFILING MPI+OPENACC APPLICATIONS
Using nvprof+NVVP

Embed MPI rank in output filename, process name, and context name
mpirun -np $np nvprof --output-profile profile.%q{OMPI_COMM_WORLD_RANK} \

--process-name "rank %q{OMPI_COMM_WORLD_RANK}“ \


New with
CUDA 7.5 --context-name "rank %q{OMPI_COMM_WORLD_RANK}"

OpenMPI: OMPI_COMM_WORLD_RANK

Alternatives: MVAPICH2: MV2_COMM_WORLD_RANK

Only save the textual output (--log-file)


Collect data from all processes that run on a node (--profile-all-processes)
6/2/20 33
PROFILING MPI+OPENACC APPLICATIONS
Using nvprof+NVVP

6/2/20 34
PROFILING MPI+OPENACC APPLICATIONS
Using nvprof+NVVP
nvvp jacobi.*.nvprof Or use the import Wizard

6/2/20 35
PROFILING MPI+OPENACC APPLICATIONS
Third Party Tools

Multiple parallel profiling tools are


OpenACC-aware
Score-P

Vampir

These tools are good for discovering MPI


issues as well as basic OpenACC performance
inhibitors.

6/2/20 36
Multi Process Service (MPS)

37
GPU ACCELERATION OF LEGACY MPI APPS

Typical legacy application


MPI parallel

Single or few threads per MPI rank (e.g. OpenMP)

Running with multiple MPI ranks per node


GPU acceleration in phases
Proof of concept prototype, …

Great speedup at kernel level

Application performance misses expectations


6/2/20 38
MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

N=1

Multicore CPU only 6/2/20 39


MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

N=1 N=2

Multicore CPU only 6/2/20 40


MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

N=1 N=2 N=4

Multicore CPU only 6/2/20 41


MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

N=1 N=2 N=4 N=8

Multicore CPU only 6/2/20 42


MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

N=1 N=2 N=4 N=8 N=1

Multicore CPU only GPU-accelerated 6/2/20 43


MULTI PROCESS SERVICE (MPS)
For Legacy MPI Applications
GPU parallelizable part
CPU parallel part
Serial part

With Hyper-Q/MPS
Available on Tesla/Quadro with CC 3.5+
(e.g. K20, K40, K80, M40,…)

N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8

Multicore CPU only GPU-accelerated 6/2/20 44


PROCESSES SHARING GPU WITHOUT MPS
No Overlap

Process A Process B
Context A Context B

GPU

Process A Process B
6/2/20 45
PROCESSES SHARING GPU WITHOUT MPS
Context Switch Overhead

Time-slided use of GPU


Context switch Context
Switch

6/2/20 46
PROCESSES SHARING GPU WITH MPS
Maximum Overlap

Process A Process B
Context A Context B

MPS Process

GPU Kernels from Kernels from


Process A Process B

6/2/20 47
PROCESSES SHARING GPU WITH MPS
No Context Switch Overhead

6/2/20 48
HYPER-Q/MPS CASE STUDY: UMT
Enables overlap
between copy and
compute of different
processes

GPU sharing
between MPI ranks
increases utilization

6/2/20 49
HYPER-Q/MPS CASE STUDIES
CPU Scaling Speedup

5
Speedup vs. 1 Rank/GPU

0
HACC MP2C VASP ENZO UMT
CPU Scaling Speedup

6/2/20 50
HYPER-Q/MPS CASE STUDIES
Additional Speedup with MPS

5
Speedup vs. 1 Rank/GPU

0
HACC MP2C VASP ENZO UMT
CPU Scaling Speedup Overlap/MPS Speedup

6/2/20 51
USING MPS

No application modifications necessary #Typical Setup

Not limited to MPI applications nvidia-smi -c EXCLUSIVE_PROCESS

MPS control daemon nvidia-cuda-mps-control –d

Spawn MPS server upon CUDA


application startup
#On Cray XK/XC systems
export CRAY_CUDA_MPS=1

6/2/20 52
MPS SUMMARY

Easy path to get GPU acceleration for legacy applications


Enables overlapping of memory copies and compute between different MPI ranks
Remark: MPS adds some overhead!

6/2/20 53
Decreasing parallel overhead

54
COMMUNICATION + COMPUTATION OVERLAP
OpenMPI 1.10.2 - 4 Tesla K80

3.5

2.5
Runtime (s)

2
No overlap
1.5 Ideal

0.5

0
8192x8192 4096x4096 2048x2048
Problem size

6/2/20 55
COMMUNICATION + COMPUTATION OVERLAP
No Overlap

Process Whole Domain MPI

Boundary and inner


domain processing can
overlap
Overlap

Process inner domain Possible gain

Process boundary
MPI
domain Dependency

6/2/20 56
COMMUNICATION + COMPUTATION OVERLAP
OpenACC with Async Queues

#pragma acc kernels present ( A, Anew )


for ( ... )
//Process boundary
#pragma acc kernels present ( A, Anew ) async
for ( ... )
//Process inner domain
#pragma acc host_data use_device ( A ) {
MPI_Sendrecv( &A[iy_start][ix_start], (ix_end-ix_start), MPI_DOUBLE, top , 0,
&A[iy_end][ix_start], (ix_end-ix_start), MPI_DOUBLE, bottom, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
MPI_Sendrecv( &A[(iy_end-1)][ix_start], (ix_end-ix_start), MPI_DOUBLE, bottom, 0,
&A[(iy_start-1)][ix_start], (ix_end-ix_start), MPI_DOUBLE, top , 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE ); }
#pragma acc wait //wait for iteration to finish
6/2/20 57
COMMUNICATION + COMPUTATION OVERLAP
OpenMPI 1.10.2 - 4 Tesla K80

4 1.09
1.08

Speedup (Overlap vs. Nooverlap)


3.5
1.07
3
1.06
2.5
Runtime (s)

1.05
2 1.04 Nooverlap
1.03 Overlap
1.5
1.02 Speedup
1
1.01
0.5 1
0 0.99
8192x8192 4096x4096 2048x2048
Local problem size

6/2/20 58
DOMAIN DECOMPOSITION STRATEGIES

Using stripes (2D)/planes(3D)


Minimizes the number of neighbors
Avoids noncontiguous halo exchange
Good for latency bound problems
Using tiles (2D)/using boxes (3D)
Minimizes surface to volume ratio
Requires noncontiguous halo exchange
Good for bandwidth bound problems 59
EXAMPLE: JACOBI SOLVER
Multi GPU with tiled domain decomposition

While not converged

Do Jacobi step:

for (int iy = iy_start; iy < iy_end; ++iy)

for (int ix = ix_start; ix < ix_end; ++ix)

Anew[iy][ix] = - 0.25f*(rhs[iy][ix] - ( A[iy][ix-1] + A[iy][ix+1]

+ A[iy-1][ix] + A[iy+1][ix]) );

Copy Anew to A

Exchange halo with 4 neighbors (ring exchange implicitly handles periodic boundary conditions)

Next iteration
60
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

61
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

(0,0) (0,1) (0,2)

(1,0) (1,1) (1,2)

(2,0) (2,1) (2,2)

62
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

(0,0) (0,1) (0,2) 0 1 2


int rank = 0;
int size = 1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
(1,0) (1,1) (1,2) 3 4 5

(2,0) (2,1) (2,2) 6 7 8

63
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

(0,0) (0,1) (0,2) 0 1 2


int rank = 0;
int size = 1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
(1,0) (1,1) (1,2) 3 4 5

(2,0) (2,1) (2,2) 6 7 8

64
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

(0,0) (0,1) (0,2) 0 1 2


int rank = 0;
int size = 1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
(1,0) (1,1) (1,2) 3 4 5

(2,0) (2,1) (2,2) MPI_Cart_create and MPI_Cart_shift 6 7 8

65
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

66
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

1. Determine Ranks in x direction and y direction:


int size = 1;
MPI_Comm_size(MPI_COMM_WORLD,&size);
int sizex = 3;
int sizey = size/sizex;
assert(sizex*sizey == size);

67
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

1. Determine Ranks in x direction and y direction:


int size = 1;
MPI_Comm_size(MPI_COMM_WORLD,&size);
int sizex = 3;
int sizey = size/sizex;
assert(sizex*sizey == size);

2. Map 1D MPI rank to 2D MPI ranks

68
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

1. Determine Ranks in x direction and y direction:


int size = 1;
0 1 2 MPI_Comm_size(MPI_COMM_WORLD,&size);
-> -> -> int sizex = 3;
(0,0) (0,1) (0,2) int sizey = size/sizex;
assert(sizex*sizey == size);

3 4 5
-> -> -> 2. Map 1D MPI rank to 2D MPI ranks
(1,0) (1,1) (1,2)
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
6 7 8 int rankx = rank%sizex;
-> -> -> int ranky = rank/sizex;
(2,0) (2,1) (2,2)
69
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

(0,0) (0,1) (0,2)


-> -> ->
0 1 2

(1,0) (1,1) (1,2)


-> -> ->
3 4 5

(2,0) (2,1) (2,2)


-> -> ->
6 7 8
70
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

1. Determine neighbors in y direction:

(0,0) (0,1) (0,2)


int lefty = (ranky == 0) ? (sizey-1) : ranky-1;
-> -> -> int righty = (ranky == (sizey-1)) ? 0 : ranky+1;
0 1 2

(1,0) (1,1) (1,2)


-> -> ->
3 4 5

(2,0) (2,1) (2,2)


-> -> ->
6 7 8
71
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity

1. Determine neighbors in y direction:

(0,0) (0,1) (0,2)


int lefty = (ranky == 0) ? (sizey-1) : ranky-1;
-> -> -> int righty = (ranky == (sizey-1)) ? 0 : ranky+1;
0 1 2

(1,0) (1,1) (1,2)


-> -> -> 2. Map 2D MPI rank to 1D MPI ranks
3 4 5
int left = lefty * sizex + rankx;
(2,0) (2,1) (2,2) int right = righty * sizex + rankx;
-> -> ->
6 7 8
72
EXAMPLE: JACOBI
Left/Right Halo

//right neighbor omitted


#pragma acc kernels present ( A, to_left )
for (int iy = iy_start; iy < iy_end; iy++)
to_left[iy] = A[iy][ix_start];

#pragma acc host_data use_device ( from_right, to_left ) {


MPI_Sendrecv( to_left, NY-2, MPI_DOUBLE, left, 0,
from_right, NY-2, MPI_DOUBLE, right, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
}

#pragma acc kernels present ( A, from_right )


for (int iy = iy_start; iy < iy_end; iy++)
A[iy][ix_end] = from_right[iy];
73
Homework

74
ACCESS TO HOMEWORK
Qwiklabs: Getting access
1. Go to https://developer.nvidia.com/qwiklabs-signup
2. Use OPENACC promo code to register and get free access
3. Receive a confirmation email with access instructions
4. Take ‘Advanced Multi GPU Programming with MPI and OpenACC’ lab:
http://bit.ly/oaccnvlab7

Questions?
Email to [email protected]
75
INSTALL THE OPENACC TOOLKIT (OPTIONAL)
Go to
developer.nvidia.com/openacc-
toolkit
Register for the OpenACC
Toolkit
Install on your personal machine
(Linux Only)
Free workstation license for
academia/90 day free trial for
the rest
76
May 19: Advanced Profiling of OpenACC Code

Course Syllabus May 26: Office Hours


June 2: Advanced multi-GPU Programming with
MPI and OpenACC

Recordings: Questions? Email [email protected]


77
https://developer.nvidia.com/openacc-advanced-course
CUDA-aware MPI implementation details

78
UNIFIED VIRTUAL ADDRESSING
No UVA: Separate Address Spaces UVA: Single Address Space

System GPU System GPU


Memory Memory Memory Memory
0x0000 0x0000 0x0000

0xFFFF 0xFFFF 0xFFFF

CPU GPU CPU GPU

PCI-e PCI-e

6/2/20 79
UNIFIED VIRTUAL ADDRESSING

One address space for all CPU and GPU memory


Determine physical memory location from a pointer value

Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

Supported on devices with compute capability 2.0+ for


64-bit applications on Linux and Windows (+TCC)
6/2/20 80
NVIDIA GPUDIRECT™
Accelerated Communication with Network & Storage Devices
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
6/2/20 81
NVIDIA GPUDIRECT™
Peer to Peer Transfers
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
6/2/20 82
NVIDIA GPUDIRECT™
Support for RDMA
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
6/2/20 83
CUDA-AWARE MPI

Example:
MPI Rank 0 MPI_Send from GPU Buffer
MPI Rank 1 MPI_Recv to GPU Buffer

Show how CUDA+MPI works in principle


Depending on the MPI implementation, message size, system setup, …
situation might be different
Two GPUs in two nodes
6/2/20 84
MPI GPU TO REMOTE GPU
Support for RDMA
MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
86
MPI GPU TO REMOTE GPU
Support for RDMA

MPI_Sendrecv

Time

6/2/20 87
REGULAR MPI GPU TO REMOTE GPU
MPI Rank 0 MPI Rank 1

GPU

Host

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice); 88
REGULAR MPI GPU TO REMOTE GPU
memcpy D->H MPI_Sendrecv memcpy H->D

Time

6/2/20 89
MPI GPU TO REMOTE GPU
without GPUDirect
MPI Rank 0 MPI Rank 1

GPU

Host

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
90
MPI GPU TO REMOTE GPU
without GPUDirect

MPI_Sendrecv

Time

6/2/20 91
PERFORMANCE RESULTS TWO NODES
OpenMPI 1.10.2 MLNX FDR IB (4X) Tesla K40@875
7000
6000
5000
BW (MB/s)

4000 CUDA-aware MPI with


3000 GPUDirect RDMA
2000 CUDA-aware MPI
1000
regular MPI
0

Message Size (Byte)

Latency (1 Byte) 24.99 us 21.72 us 5.65 us 6/2/20 92

You might also like