Advanced OpenACC Course Lecture2 Multi GPU 20160602
Advanced OpenACC Course Lecture2 Multi GPU 20160602
2
May 19: Advanced Profiling of OpenACC Code
Recordings: 3
https://developer.nvidia.com/openacc-advanced-course
ADVANCED MULTI-GPU PROGRAMMING
WITH MPI AND OPENACC
Lecture 2: Jiri Kraus, NVIDIA
MPI+OPENACC
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory
…
GPU CPU GPU CPU GPU CPU
5
MPI+OPENACC
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory
…
GPU CPU GPU CPU GPU CPU
6
MPI+OPENACC
//MPI rank 0
#pragma acc host_data use_device( sbuf )
MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);
7
Using MPI for inter GPU communication
Debugging and Profiling of MPI+OpenACC apps
Agenda Multi Process Service (MPS)
Decreasing parallel overhead
8
Using MPI for inter GPU communication
9
MESSAGE PASSING INTERFACE - MPI
10
MPI - SKELETON
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank,size;
/* Initialize the MPI library */
MPI_Init(&argc,&argv);
/* Determine the calling process rank and total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
/* Call MPI routines like MPI_Send, MPI_Recv, ... */
...
/* Shutdown MPI library */
MPI_Finalize();
return 0;
}
11
MPI
Compiling and Launching
12
EXAMPLE: JACOBI SOLVER
13
EXAMPLE: JACOBI SOLVER
Single GPU
Do Jacobi step:
+ A[iy-1][ix] + A[iy+1][ix]) );
Copy Anew to A
Next iteration
14
EXAMPLE: JACOBI SOLVER
Multi GPU
Do Jacobi step:
+ A[iy-1][ix] + A[iy+1][ix]) );
Copy Anew to A
Next iteration
15
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{
6/2/20 16
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{
6/2/20 17
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{
6/2/20 18
EXAMPLE JACOBI
Top/Bottom Halo
#pragma acc host_data use_device ( A )
{
6/2/20 19
HANDLING MULTI GPU NODES
GPU-affinity
#if _OPENACC
if ( acc_device_nvidia == device_type ) {
int ngpus=acc_get_num_devices(acc_device_nvidia);
int devicenum=rank%ngpus;
acc_set_device_num(devicenum,acc_device_nvidia);
21
TOOLS FOR MPI+OPENACC APPLICATIONS
6/2/20 22
MEMORY CHECKING WITH CUDA-MEMCHECK
--log-file name.%q{OMPI_COMM_WORLD_RANK}.log \
--save name.%q{OMPI_COMM_WORLD_RANK}.memcheck \
./myapp <args>
6/2/20 23
MEMORY CHECKING WITH CUDA-MEMCHECK
6/2/20 24
MEMORY CHECKING WITH CUDA-MEMCHECK
Read Output Files with cuda-memcheck --read
6/2/20 25
DEBUGGING MPI+OPENACC APPLICATIONS
Using cuda-gdb with MPI Applications
6/2/20 26
DEBUGGING MPI+OPENACC APPLICATIONS
cuda-gdb Attach
if ( rank == 0 ) {
int i=0;
printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(),name);
while (0 == i) { sleep(5); }
}
6/2/20 27
DEBUGGING MPI+OPENACC APPLICATIONS
CUDA_DEVICE_WAITS_ON_EXCEPTION
6/2/20 28
DEBUGGING MPI+OPENACC APPLICATIONS
Helpful if live debugging is not possible, e.g. too many nodes needed to reproduce
6/2/20 30
DEBUGGING MPI+OPENACC APPLICATIONS
CUDA_ENABLE_COREDUMP_ON_EXCEPTION
6/2/20 31
DEBUGGING MPI+OPENACC APPLICATIONS
Third Party Tools
6/2/20 32
PROFILING MPI+OPENACC APPLICATIONS
Using nvprof+NVVP
Embed MPI rank in output filename, process name, and context name
mpirun -np $np nvprof --output-profile profile.%q{OMPI_COMM_WORLD_RANK} \
OpenMPI: OMPI_COMM_WORLD_RANK
6/2/20 34
PROFILING MPI+OPENACC APPLICATIONS
Using nvprof+NVVP
nvvp jacobi.*.nvprof Or use the import Wizard
6/2/20 35
PROFILING MPI+OPENACC APPLICATIONS
Third Party Tools
Vampir
6/2/20 36
Multi Process Service (MPS)
37
GPU ACCELERATION OF LEGACY MPI APPS
N=1
N=1 N=2
With Hyper-Q/MPS
Available on Tesla/Quadro with CC 3.5+
(e.g. K20, K40, K80, M40,…)
Process A Process B
Context A Context B
GPU
Process A Process B
6/2/20 45
PROCESSES SHARING GPU WITHOUT MPS
Context Switch Overhead
6/2/20 46
PROCESSES SHARING GPU WITH MPS
Maximum Overlap
Process A Process B
Context A Context B
MPS Process
6/2/20 47
PROCESSES SHARING GPU WITH MPS
No Context Switch Overhead
6/2/20 48
HYPER-Q/MPS CASE STUDY: UMT
Enables overlap
between copy and
compute of different
processes
GPU sharing
between MPI ranks
increases utilization
6/2/20 49
HYPER-Q/MPS CASE STUDIES
CPU Scaling Speedup
5
Speedup vs. 1 Rank/GPU
0
HACC MP2C VASP ENZO UMT
CPU Scaling Speedup
6/2/20 50
HYPER-Q/MPS CASE STUDIES
Additional Speedup with MPS
5
Speedup vs. 1 Rank/GPU
0
HACC MP2C VASP ENZO UMT
CPU Scaling Speedup Overlap/MPS Speedup
6/2/20 51
USING MPS
6/2/20 52
MPS SUMMARY
6/2/20 53
Decreasing parallel overhead
54
COMMUNICATION + COMPUTATION OVERLAP
OpenMPI 1.10.2 - 4 Tesla K80
3.5
2.5
Runtime (s)
2
No overlap
1.5 Ideal
0.5
0
8192x8192 4096x4096 2048x2048
Problem size
6/2/20 55
COMMUNICATION + COMPUTATION OVERLAP
No Overlap
Process boundary
MPI
domain Dependency
6/2/20 56
COMMUNICATION + COMPUTATION OVERLAP
OpenACC with Async Queues
4 1.09
1.08
1.05
2 1.04 Nooverlap
1.03 Overlap
1.5
1.02 Speedup
1
1.01
0.5 1
0 0.99
8192x8192 4096x4096 2048x2048
Local problem size
6/2/20 58
DOMAIN DECOMPOSITION STRATEGIES
Do Jacobi step:
+ A[iy-1][ix] + A[iy+1][ix]) );
Copy Anew to A
Exchange halo with 4 neighbors (ring exchange implicitly handles periodic boundary conditions)
Next iteration
60
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
61
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
62
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
63
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
64
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
65
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
66
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
67
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
68
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
3 4 5
-> -> -> 2. Map 1D MPI rank to 2D MPI ranks
(1,0) (1,1) (1,2)
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
6 7 8 int rankx = rank%sizex;
-> -> -> int ranky = rank/sizex;
(2,0) (2,1) (2,2)
69
DOMAIN DECOMPOSITION WITH TILES
Handling Tile <-> GPU/MPI Rank affinity
74
ACCESS TO HOMEWORK
Qwiklabs: Getting access
1. Go to https://developer.nvidia.com/qwiklabs-signup
2. Use OPENACC promo code to register and get free access
3. Receive a confirmation email with access instructions
4. Take ‘Advanced Multi GPU Programming with MPI and OpenACC’ lab:
http://bit.ly/oaccnvlab7
Questions?
Email to [email protected]
75
INSTALL THE OPENACC TOOLKIT (OPTIONAL)
Go to
developer.nvidia.com/openacc-
toolkit
Register for the OpenACC
Toolkit
Install on your personal machine
(Linux Only)
Free workstation license for
academia/90 day free trial for
the rest
76
May 19: Advanced Profiling of OpenACC Code
78
UNIFIED VIRTUAL ADDRESSING
No UVA: Separate Address Spaces UVA: Single Address Space
PCI-e PCI-e
6/2/20 79
UNIFIED VIRTUAL ADDRESSING
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
6/2/20 81
NVIDIA GPUDIRECT™
Peer to Peer Transfers
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
6/2/20 82
NVIDIA GPUDIRECT™
Support for RDMA
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
6/2/20 83
CUDA-AWARE MPI
Example:
MPI Rank 0 MPI_Send from GPU Buffer
MPI Rank 1 MPI_Recv to GPU Buffer
GPU
Host
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
86
MPI GPU TO REMOTE GPU
Support for RDMA
MPI_Sendrecv
Time
6/2/20 87
REGULAR MPI GPU TO REMOTE GPU
MPI Rank 0 MPI Rank 1
GPU
Host
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice); 88
REGULAR MPI GPU TO REMOTE GPU
memcpy D->H MPI_Sendrecv memcpy H->D
Time
6/2/20 89
MPI GPU TO REMOTE GPU
without GPUDirect
MPI Rank 0 MPI Rank 1
GPU
Host
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
90
MPI GPU TO REMOTE GPU
without GPUDirect
MPI_Sendrecv
Time
6/2/20 91
PERFORMANCE RESULTS TWO NODES
OpenMPI 1.10.2 MLNX FDR IB (4X) Tesla K40@875
7000
6000
5000
BW (MB/s)