Programming weather, climate, and earth-system models on heterogeneous multi-core platforms
National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012
.
KernelGen – A prototype of auto-parallelizing
Fortran/C compiler for NVIDIA GPUs
Dmitry Mikushin1,3 Nikolay Likhogrud2,3 Hou Yunqing4
Sergey Kovylov5
1 Institute of Computational Science, University of Lugano
2 Lomonosov Moscow State University
3 Applied Parallel Computing LLC
4 Nanyang Technological University
5 NVIDIA
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 1 / 23
.
KernelGen research project
Goals:
Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
KernelGen research project
Goals:
Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models
Rationale:
Old good programming languages could still be usable, if accurate code analysis
& parallelization methods exist
OpenACC is too restrictive for complex apps and needs more flexibility
GPU tends to become a central processing unit in near future, contradicting
with OpenACC paradigm
NWP is a perfect testbed for novel accelerator programming models
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
WRF specifics
Sets of multiple numerical blocks to switch between, depending on model
purpose ⇒ no need to compile all code for GPU at time, JIT-compile only used
parts
Complex compilation system, most of code is compiled to static libraries, many
potential GPU kernels have external dependencies ⇒ needs modified linker to
resolve kernels dependencies at link time
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 3 / 23
.
Project Team
Lomonosov Moscow State
University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project Team
Lomonosov Moscow State
University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics
With technical support of many communities:
+ AsFermi, OpenMPI and others
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project state in September 2011 (v0.1)
Results:
Could successfully generate CUDA and OpenCL kernels out of parallel loops in
Fortran, with lots of limitations
Automatic handling of host-device data transfers, with all process data kept on
host
Better language support than F2C-ACC, but still a lot of issues
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2011 (v0.1)
Results:
Could successfully generate CUDA and OpenCL kernels out of parallel loops in
Fortran, with lots of limitations
Automatic handling of host-device data transfers, with all process data kept on
host
Better language support than F2C-ACC, but still a lot of issues
Implementation:
Pretty-printed AST – to markup and transform code into host and device parts
No reliable data dependency analysis in loops
LLVM + C Backend – to convert Fortran to C and chain to CUDA compiler
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker
Implementation:
DragonEgg – to emit LLVM IR from C/C++/Fortran
LLVM loop extractor pass – to detect loops in compile time
Modified LLVM Polly – to perform loop analysis in runtime
LLVM NVPTX Backend – to emit PTX ISA directly from LLVM IR
Modified GCC compiler and custom LTO wrapper – to support calling external
functions in loops and link code from static libraries
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
KernelGen user interface design
KernelGen is based on GCC and is fully compatible with it
Executable binary preserves host-only version, that is used by default; GPU
version is activated by request
Execution mode is controlled by $kernelgen runmode: 0 – run original CPU
binary, 1 – run GPU version
$ NETCDF=/ opt / kernelgen . / c o n f i g u r e
Please s e l e c t from among the f o l l o w i n g supported platforms .
...
2 7 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( serial )
2 8 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( smpar )
2 9 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( dmpar )
3 0 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA (dm+sm )
Enter s e l e c t i o n [ 1 - 3 8 ] : 27
...
$ . / compile em_real
...
$ cd t e s t / em_real /
$ kernelgen_runmode=1 . / r e a l . exe
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 7 / 23
.
OpenACC: no external calls
OpenACC compilers do not allow calls from different compilation units:
sincos.f90
! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l
function.f90
s i n c o s _ i j k = s i n ( x ) + cos ( y )
pgfortran - f a s t -Mnomain - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . f90 - o ←-
sincos . o
PGF90 -W-0155 - A c c e l e r a t o r region ignored ; see - Minfo messages ( . . / s i n c o s . f90 : 33)
sincos :
33 , A c c e l e r a t o r region ignored
36 , A c c e l e r a t o r r e s t r i c t i o n : f u n c t i o n / procedure c a l l s are not supported
37 , A c c e l e r a t o r r e s t r i c t i o n : unsupported c a l l to s i n c o s _ i j k
0 inform , 1 warnings , 0 severes , 0 f a t a l f o r s i n c o s
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 8 / 23
.
KernelGen: external calls
}
Dependency resolution during linking Support for external calls defined
⇒
Kernels generation in runtime in other objects or static libraries
! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l
s i n c o s _ i j k = s i n ( x ) + cos ( y )
result
Launching k e r n e l __k ernel ge n_sin co s__ lo op_ 3
blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __ke rne lg en_si ncos_ _l oop_3
__kernelgen _s incos__ loo p_ 3 time = 0.00536099 sec
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 9 / 23
.
OpenACC: no pointers tracking
In Fortran allocatable arrays carry their dimensions. Not the case in C:
sincos.c
void s i n c o s ( i n t nx , i n t ny , i n t nz , f l o a t * x , f l o a t * y , f l o a t * xy )
{
#pragma acc p a r a l l e l
f o r ( i n t k = 0 ; k < nz ; k ++)
f o r ( i n t j = 0 ; j < ny ; j ++)
f o r ( i n t i = 0 ; i < nx ; i ++)
{
i n t i d x = i + nx * j + nx * ny * k ;
xy [ i d x ] = s i n ( x [ i d x ] ) + cos ( y [ i d x ] ) ;
}
...
}
pgcc - f a s t - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . c - o s i n c o s . o
PGC -W-0155 - Compiler f a i l e d to t r a n s l a t e a c c e l e r a t o r region ( see - Minfo messages ) : Could not f i n d a l l o c a t e d - ←-
v a r i a b l e index f o r symbol ( . . / s i n c o s . c : 27)
sincos :
27 , A c c e l e r a t o r k e r n e l generated
28 , Complex loop c a r r i e d dependence of * ( y ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( x ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( xy ) prevents p a r a l l e l i z a t i o n
...
30 , A c c e l e r a t o r r e s t r i c t i o n : s i z e of the GPU copy of xy i s unknown
...
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 10 / 23
.
KernelGen: smart pointers tracking
Pointer alias analysis is performed in runtime, assisted with addresses substitution.
sincos.c
void s i n c o s ( i n t nx , i n t ny , i n t nz , f l o a t * x , f l o a t * y , f l o a t * xy )
{
#pragma acc p a r a l l e l
f o r ( i n t k = 0 ; k < nz ; k ++)
f o r ( i n t j = 0 ; j < ny ; j ++)
f o r ( i n t i = 0 ; i < nx ; i ++)
{
i n t i d x = i + nx * j + nx * ny * k ;
xy [ i d x ] = s i n ( x [ i d x ] ) + cos ( y [ i d x ] ) ;
}
...
}
result
Launching k e r n e l __kernelgen_sincos_loop _8 . preheader
blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __kernelgen_ sincos_loop _8 . preheader
__kernelgen_sincos_loop_8 . preheader time = 0.00528601 sec
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 11 / 23
.
KernelGen: can parallelize while loops
Thanks to the nature of LLVM and Polly, KernelGen can parallelize while-loops
semantically equivalent to for-s (OpenACC can’t):
i = 1
do while ( i . l e . nx )
j = 1
do while ( j . l e . nz )
k = 1
do while ( k . l e . ny )
C( i , j ) = C( i , j ) + A( i , k ) * B( k , j )
k = k + 1
enddo
j = j + 1
enddo
i = i + 1
enddo
Launching k e r n e l __kernelgen_matmul__loop_9
blockDim = { 32 , 32 , 1 }
gridDim = { 2 , 16 , 1 }
F i n i s h i n g k e r n e l __kernelgen_matmul__loop_9
__kernelgen_matmul__loop_9 time = 0.00953514 sec
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 12 / 23
.
Benchmarking: sincos
xy[i,j,k] := sin(x[i,j,k]) + cos(y[i,j,k])
6 5.49
kernel execution time, ms
4.5
4
0
Kernelgen/Fermi PGI/Fermi
(less is better)
PGI 12.6, Fermi – Tesla C2050
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 13 / 23
.
Benchmarking: matmul
PGI is currently faster because of partial reduction in registers:
10 9.41
kernel execution time, ms
8
6.01
6
2 1.01 0.95
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)
PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 14 / 23
.
Benchmarking: jacobi
On finite-difference patterns KernelGen performance is better:
30 28.35
kernel execution time, ms
23.43
24 20.42
19.09
18
11.36
12 9.54 9.44 8.5
6
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)
compute kernel data copy kernel
PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 15 / 23
.
KernelGen concepts
Main GPU and peripheral host-system: initially port on GPU as much parallel
code as possible, without human decision
Fallback to CPU version in case of calls to host-only functions (I/O, syscalls, ...)
or non-parallel loops or inefficient parallel code
Perform transparent host-device data sharing on-demand, keeping all data on
device by default, rather than on host
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
KernelGen concepts
Main GPU and peripheral host-system: initially port on GPU as much parallel
code as possible, without human decision
Fallback to CPU version in case of calls to host-only functions (I/O, syscalls, ...)
or non-parallel loops or inefficient parallel code
Perform transparent host-device data sharing on-demand, keeping all data on
device by default, rather than on host
Use GCC frontends to support major programming languages (Fortran, C/C++,
Ada, etc.)
Unify all languages to the common intermediate representation
Extract potentially parallel loops into kernels during compile-time, but decide
the execution mode, taking in account runtime information (JIT)
Adjust kernel execution mode, using the dynamically collected statistics or use
profile files from previous runs
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
LLVM for Fortran & GPU in a nutshell
LLVM – a universal system of programs analysis, transformation and optimization
with RISC-like intermediate representation (LLVM IR SSA)
Frontends
. Backends
LLVM IR
(clang, GHC, ...) (x86, arm, ptx, ...)
Analysis, optimization and transformation passes
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 17 / 23
.
LLVM for Fortran & GPU in a nutshell
Consider the following kernel written in Fortran:
subroutine sum_kernel ( a , b , c , length )
i m p l i c i t none
i n t e g e r : : length
r e a l , dimension ( length ) : : a , b , c
i n t e g e r : : idx , t h r e a d I dx _ x
i d x = t hre a dI dx _ x ( ) + 1
c ( idx ) = a ( idx ) + b( idx )
end subroutine sum_kernel
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 18 / 23
.
LLVM for Fortran & GPU in a nutshell
With help of GCC and DragonEgg it could be translated into LLVM IR:
$ kernelgen - dragonegg k e r n e l . f90 - o - | opt -O3 - S - o k e r n e l . l l
t a r g e t dat a l a y o u t = ” e - p : 6 4 : 6 4 : 6 4 - S128 - i 1 : 8 : 8 - i 8 : 8 : 8 - i 1 6 : 1 6 : 1 6 - i 3 2 : 3 2 : 3 2 - i 6 4 : 6 4 : 6 4 - f16 : 1 6 : 1 6 - f32 : 3 2 : 3 2 - f64 : 6 4 : 6 4 - ←-
f128 : 1 2 8 : 1 2 8 - v64 : 6 4 : 6 4 - v128 : 1 2 8 : 1 2 8 - a0 : 0 : 6 4 - s0 : 6 4 : 6 4 - f80 : 1 2 8 : 1 2 8 - n8 : 1 6 : 3 2 : 6 4 ”
t a r g e t t r i p l e = ” x86_64 - unknown - l i n u x - gnu ”
define void @sum_kernel_ ( [ 0 x f l o a t ] * n o a l i a s nocapture %a, [ 0 x f l o a t ] * n o a l i a s nocapture %b, [ 0 x f l o a t ] * ←-
n o a l i a s nocapture %c , i 3 2 * n o a l i a s nocapture %length ) nounwind uwtable {
entry :
%0 = t a i l c a l l i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( ) nounwind
%1 = add i 3 2 %0, 1
%2 = sext i 3 2 %1 to i 6 4
%3 = add i 6 4 %2, -1
%4 = getelementptr [ 0 x f l o a t ] * %a, i 6 4 0 , i 6 4 %3
%5 = load f l o a t * %4, a l i g n 4
%6 = getelementptr [ 0 x f l o a t ] * %b, i 6 4 0 , i 6 4 %3
%7 = load f l o a t * %6, a l i g n 4
%8 = fadd f l o a t %5, %7
%9 = getelementptr [ 0 x f l o a t ] * %c , i 6 4 0 , i 6 4 %3
s t o r e f l o a t %8, f l o a t * %9, a l i g n 4
r e t void
}
d e c l a r e i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( )
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 19 / 23
.
LLVM for Fortran & GPU in a nutshell
PTX GPU assembly can be emitted from LLVM IR with help of NVPTX backend:
$ l l c - march =” nvptx64 ” -mcpu=”sm_30 ” k e r n e l . l l - o k e r n e l . ptx
. func sum_kernel_ ( . param . b64 sum_kernel__param_0 , . param . b64 sum_kernel__param_1 , . param . b64 ←-
sum_kernel__param_2 , . param . b64 sum_kernel__param_3 ) {
. reg . pred %p<396 >;
. reg . s16 %rc <396 >;
. reg . s16 %rs <396 >;
. reg . s32 %r <396 >;
. reg . s64 %r l <396 >;
. reg . f32 %f <396 >;
. reg . f64 %f l <396 >;
mov . u32 %r0 , %t i d . x ;
add . s32 %r0 , %r0 , 1 ;
c v t . s64 . s32 %r l 0 , %r0 ;
add . s64 %r l 0 , %r l 0 , - 1 ;
s h l . b64 %r l 0 , %r l 0 , 2 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_0 ] ;
add . s64 %r l 1 , %r l 1 , %r l 0 ;
l d . param . u64 %r l 2 , [ sum_kernel__param_1 ] ;
add . s64 %r l 2 , %r l 2 , %r l 0 ;
l d . f32 %f0 , [% r l 2 ] ;
l d . f32 %f1 , [% r l 1 ] ;
add . f32 %f0 , %f1 , %f0 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_2 ] ;
add . s64 %r l 0 , %r l 1 , %r l 0 ;
s t . f32 [% r l 0 ] , %f0 ;
ret ;
}
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 20 / 23
.
http://kernelgen.org/testit/
Please help us to improve the quality and usefulness of KernelGen
The code is open-source and could be easily compiled into binary package
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 21 / 23
.
Technical plan for Stage 3 (Fall 2012)
Compiler core improvements (by priority):
1 Get rid of code inlining before applying loops analysis with Polly
2 Fix crashes of kernels using CUDA math functions on Kepler
3 Solve problems with compilation of big kernels using ptxas
4 Rewrite gpu-cpu data sharing model more efficiently
5 Replace host-assisted loop kernels launching with Kepler K20’s dynamic
parallelism
6 Enable Polly tiling with support of shared memory, loops interchanging and
Kepler’s warp shuffle
Improve usability:
Create Ubuntu PPA repository shipping KernelGen compiler binaries
Testing: NPB, polybench, COSMO radiation, WRF
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 22 / 23
Download link for this presentation:
http://kernelgen.org/ncar2012/
Project mailing list:
[email protected] Thank you! ,
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 23 / 23