0% found this document useful (0 votes)
9 views28 pages

Kernelgen Ncar 2012 Slides

The document discusses the KernelGen project, a prototype compiler designed to auto-parallelize Fortran/C code for NVIDIA GPUs, aiming to minimize manual coding efforts while preserving original source code. It highlights the project's goals, rationale, and various implementation stages, showcasing improvements in parallelism detection and compatibility with existing compilers. The document also details the project's capabilities, including handling external calls and parallelizing while-loops, which are not supported by OpenACC.

Uploaded by

magicland300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Kernelgen Ncar 2012 Slides

The document discusses the KernelGen project, a prototype compiler designed to auto-parallelize Fortran/C code for NVIDIA GPUs, aiming to minimize manual coding efforts while preserving original source code. It highlights the project's goals, rationale, and various implementation stages, showcasing improvements in parallelism detection and compatibility with existing compilers. The document also details the project's capabilities, including handling external calls and parallelizing while-loops, which are not supported by OpenACC.

Uploaded by

magicland300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms

National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012
.

KernelGen – A prototype of auto-parallelizing


Fortran/C compiler for NVIDIA GPUs

Dmitry Mikushin1,3 Nikolay Likhogrud2,3 Hou Yunqing4


Sergey Kovylov5

1 Institute of Computational Science, University of Lugano

2 Lomonosov Moscow State University

3 Applied Parallel Computing LLC

4 Nanyang Technological University

5 NVIDIA

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 1 / 23
.
KernelGen research project

Goals:

Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
KernelGen research project

Goals:

Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models

Rationale:

Old good programming languages could still be usable, if accurate code analysis
& parallelization methods exist
OpenACC is too restrictive for complex apps and needs more flexibility
GPU tends to become a central processing unit in near future, contradicting
with OpenACC paradigm
NWP is a perfect testbed for novel accelerator programming models

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
WRF specifics

Sets of multiple numerical blocks to switch between, depending on model


purpose ⇒ no need to compile all code for GPU at time, JIT-compile only used
parts
Complex compilation system, most of code is compiled to static libraries, many
potential GPU kernels have external dependencies ⇒ needs modified linker to
resolve kernels dependencies at link time

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 3 / 23
.
Project Team

Lomonosov Moscow State


University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project Team

Lomonosov Moscow State


University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics

With technical support of many communities:

+ AsFermi, OpenMPI and others

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project state in September 2011 (v0.1)

Results:

Could successfully generate CUDA and OpenCL kernels out of parallel loops in
Fortran, with lots of limitations
Automatic handling of host-device data transfers, with all process data kept on
host
Better language support than F2C-ACC, but still a lot of issues

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2011 (v0.1)

Results:

Could successfully generate CUDA and OpenCL kernels out of parallel loops in
Fortran, with lots of limitations
Automatic handling of host-device data transfers, with all process data kept on
host
Better language support than F2C-ACC, but still a lot of issues

Implementation:

Pretty-printed AST – to markup and transform code into host and device parts
No reliable data dependency analysis in loops
LLVM + C Backend – to convert Fortran to C and chain to CUDA compiler

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker

Implementation:
DragonEgg – to emit LLVM IR from C/C++/Fortran
LLVM loop extractor pass – to detect loops in compile time
Modified LLVM Polly – to perform loop analysis in runtime
LLVM NVPTX Backend – to emit PTX ISA directly from LLVM IR
Modified GCC compiler and custom LTO wrapper – to support calling external
functions in loops and link code from static libraries
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
KernelGen user interface design

KernelGen is based on GCC and is fully compatible with it


Executable binary preserves host-only version, that is used by default; GPU
version is activated by request
Execution mode is controlled by $kernelgen runmode: 0 – run original CPU
binary, 1 – run GPU version

$ NETCDF=/ opt / kernelgen . / c o n f i g u r e


Please s e l e c t from among the f o l l o w i n g supported platforms .
...
2 7 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( serial )
2 8 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( smpar )
2 9 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( dmpar )
3 0 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA (dm+sm )
Enter s e l e c t i o n [ 1 - 3 8 ] : 27
...
$ . / compile em_real
...
$ cd t e s t / em_real /
$ kernelgen_runmode=1 . / r e a l . exe
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 7 / 23
.
OpenACC: no external calls
OpenACC compilers do not allow calls from different compilation units:

sincos.f90
! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l

function.f90
s i n c o s _ i j k = s i n ( x ) + cos ( y )

pgfortran - f a s t -Mnomain - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . f90 - o ←-


sincos . o
PGF90 -W-0155 - A c c e l e r a t o r region ignored ; see - Minfo messages ( . . / s i n c o s . f90 : 33)
sincos :
33 , A c c e l e r a t o r region ignored
36 , A c c e l e r a t o r r e s t r i c t i o n : f u n c t i o n / procedure c a l l s are not supported
37 , A c c e l e r a t o r r e s t r i c t i o n : unsupported c a l l to s i n c o s _ i j k
0 inform , 1 warnings , 0 severes , 0 f a t a l f o r s i n c o s

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 8 / 23
.
KernelGen: external calls

}
Dependency resolution during linking Support for external calls defined

Kernels generation in runtime in other objects or static libraries

! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l

s i n c o s _ i j k = s i n ( x ) + cos ( y )

result
Launching k e r n e l __k ernel ge n_sin co s__ lo op_ 3
blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __ke rne lg en_si ncos_ _l oop_3
__kernelgen _s incos__ loo p_ 3 time = 0.00536099 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 9 / 23
.
OpenACC: no pointers tracking
In Fortran allocatable arrays carry their dimensions. Not the case in C:

sincos.c
void s i n c o s ( i n t nx , i n t ny , i n t nz , f l o a t * x , f l o a t * y , f l o a t * xy )
{
#pragma acc p a r a l l e l
f o r ( i n t k = 0 ; k < nz ; k ++)
f o r ( i n t j = 0 ; j < ny ; j ++)
f o r ( i n t i = 0 ; i < nx ; i ++)
{
i n t i d x = i + nx * j + nx * ny * k ;
xy [ i d x ] = s i n ( x [ i d x ] ) + cos ( y [ i d x ] ) ;
}
...
}

pgcc - f a s t - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . c - o s i n c o s . o


PGC -W-0155 - Compiler f a i l e d to t r a n s l a t e a c c e l e r a t o r region ( see - Minfo messages ) : Could not f i n d a l l o c a t e d - ←-
v a r i a b l e index f o r symbol ( . . / s i n c o s . c : 27)
sincos :
27 , A c c e l e r a t o r k e r n e l generated
28 , Complex loop c a r r i e d dependence of * ( y ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( x ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( xy ) prevents p a r a l l e l i z a t i o n
...
30 , A c c e l e r a t o r r e s t r i c t i o n : s i z e of the GPU copy of xy i s unknown
...

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 10 / 23
.
KernelGen: smart pointers tracking
Pointer alias analysis is performed in runtime, assisted with addresses substitution.

sincos.c
void s i n c o s ( i n t nx , i n t ny , i n t nz , f l o a t * x , f l o a t * y , f l o a t * xy )
{
#pragma acc p a r a l l e l
f o r ( i n t k = 0 ; k < nz ; k ++)
f o r ( i n t j = 0 ; j < ny ; j ++)
f o r ( i n t i = 0 ; i < nx ; i ++)
{
i n t i d x = i + nx * j + nx * ny * k ;
xy [ i d x ] = s i n ( x [ i d x ] ) + cos ( y [ i d x ] ) ;
}
...
}

result

Launching k e r n e l __kernelgen_sincos_loop _8 . preheader


blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __kernelgen_ sincos_loop _8 . preheader
__kernelgen_sincos_loop_8 . preheader time = 0.00528601 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 11 / 23
.
KernelGen: can parallelize while loops

Thanks to the nature of LLVM and Polly, KernelGen can parallelize while-loops
semantically equivalent to for-s (OpenACC can’t):
i = 1
do while ( i . l e . nx )
j = 1
do while ( j . l e . nz )
k = 1
do while ( k . l e . ny )
C( i , j ) = C( i , j ) + A( i , k ) * B( k , j )
k = k + 1
enddo
j = j + 1
enddo
i = i + 1
enddo

Launching k e r n e l __kernelgen_matmul__loop_9
blockDim = { 32 , 32 , 1 }
gridDim = { 2 , 16 , 1 }
F i n i s h i n g k e r n e l __kernelgen_matmul__loop_9
__kernelgen_matmul__loop_9 time = 0.00953514 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 12 / 23
.
Benchmarking: sincos

xy[i,j,k] := sin(x[i,j,k]) + cos(y[i,j,k])

6 5.49
kernel execution time, ms

4.5
4

0
Kernelgen/Fermi PGI/Fermi
(less is better)

PGI 12.6, Fermi – Tesla C2050

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 13 / 23
.
Benchmarking: matmul

PGI is currently faster because of partial reduction in registers:

10 9.41
kernel execution time, ms

8
6.01
6

2 1.01 0.95
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 14 / 23
.
Benchmarking: jacobi

On finite-difference patterns KernelGen performance is better:

30 28.35
kernel execution time, ms

23.43
24 20.42
19.09
18
11.36
12 9.54 9.44 8.5
6
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)

compute kernel data copy kernel

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 15 / 23
.
KernelGen concepts

Main GPU and peripheral host-system: initially port on GPU as much parallel
code as possible, without human decision
Fallback to CPU version in case of calls to host-only functions (I/O, syscalls, ...)
or non-parallel loops or inefficient parallel code
Perform transparent host-device data sharing on-demand, keeping all data on
device by default, rather than on host

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
KernelGen concepts

Main GPU and peripheral host-system: initially port on GPU as much parallel
code as possible, without human decision
Fallback to CPU version in case of calls to host-only functions (I/O, syscalls, ...)
or non-parallel loops or inefficient parallel code
Perform transparent host-device data sharing on-demand, keeping all data on
device by default, rather than on host
Use GCC frontends to support major programming languages (Fortran, C/C++,
Ada, etc.)
Unify all languages to the common intermediate representation
Extract potentially parallel loops into kernels during compile-time, but decide
the execution mode, taking in account runtime information (JIT)
Adjust kernel execution mode, using the dynamically collected statistics or use
profile files from previous runs

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
LLVM for Fortran & GPU in a nutshell

LLVM – a universal system of programs analysis, transformation and optimization


with RISC-like intermediate representation (LLVM IR SSA)

Frontends
. Backends
LLVM IR
(clang, GHC, ...) (x86, arm, ptx, ...)

Analysis, optimization and transformation passes

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 17 / 23
.
LLVM for Fortran & GPU in a nutshell

Consider the following kernel written in Fortran:

subroutine sum_kernel ( a , b , c , length )


i m p l i c i t none

i n t e g e r : : length
r e a l , dimension ( length ) : : a , b , c
i n t e g e r : : idx , t h r e a d I dx _ x

i d x = t hre a dI dx _ x ( ) + 1

c ( idx ) = a ( idx ) + b( idx )

end subroutine sum_kernel

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 18 / 23
.
LLVM for Fortran & GPU in a nutshell

With help of GCC and DragonEgg it could be translated into LLVM IR:

$ kernelgen - dragonegg k e r n e l . f90 - o - | opt -O3 - S - o k e r n e l . l l

t a r g e t dat a l a y o u t = ” e - p : 6 4 : 6 4 : 6 4 - S128 - i 1 : 8 : 8 - i 8 : 8 : 8 - i 1 6 : 1 6 : 1 6 - i 3 2 : 3 2 : 3 2 - i 6 4 : 6 4 : 6 4 - f16 : 1 6 : 1 6 - f32 : 3 2 : 3 2 - f64 : 6 4 : 6 4 - ←-


f128 : 1 2 8 : 1 2 8 - v64 : 6 4 : 6 4 - v128 : 1 2 8 : 1 2 8 - a0 : 0 : 6 4 - s0 : 6 4 : 6 4 - f80 : 1 2 8 : 1 2 8 - n8 : 1 6 : 3 2 : 6 4 ”
t a r g e t t r i p l e = ” x86_64 - unknown - l i n u x - gnu ”

define void @sum_kernel_ ( [ 0 x f l o a t ] * n o a l i a s nocapture %a, [ 0 x f l o a t ] * n o a l i a s nocapture %b, [ 0 x f l o a t ] * ←-


n o a l i a s nocapture %c , i 3 2 * n o a l i a s nocapture %length ) nounwind uwtable {
entry :
%0 = t a i l c a l l i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( ) nounwind
%1 = add i 3 2 %0, 1
%2 = sext i 3 2 %1 to i 6 4
%3 = add i 6 4 %2, -1
%4 = getelementptr [ 0 x f l o a t ] * %a, i 6 4 0 , i 6 4 %3
%5 = load f l o a t * %4, a l i g n 4
%6 = getelementptr [ 0 x f l o a t ] * %b, i 6 4 0 , i 6 4 %3
%7 = load f l o a t * %6, a l i g n 4
%8 = fadd f l o a t %5, %7
%9 = getelementptr [ 0 x f l o a t ] * %c , i 6 4 0 , i 6 4 %3
s t o r e f l o a t %8, f l o a t * %9, a l i g n 4
r e t void
}

d e c l a r e i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( )

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 19 / 23
.
LLVM for Fortran & GPU in a nutshell
PTX GPU assembly can be emitted from LLVM IR with help of NVPTX backend:
$ l l c - march =” nvptx64 ” -mcpu=”sm_30 ” k e r n e l . l l - o k e r n e l . ptx

. func sum_kernel_ ( . param . b64 sum_kernel__param_0 , . param . b64 sum_kernel__param_1 , . param . b64 ←-
sum_kernel__param_2 , . param . b64 sum_kernel__param_3 ) {
. reg . pred %p<396 >;
. reg . s16 %rc <396 >;
. reg . s16 %rs <396 >;
. reg . s32 %r <396 >;
. reg . s64 %r l <396 >;
. reg . f32 %f <396 >;
. reg . f64 %f l <396 >;
mov . u32 %r0 , %t i d . x ;
add . s32 %r0 , %r0 , 1 ;
c v t . s64 . s32 %r l 0 , %r0 ;
add . s64 %r l 0 , %r l 0 , - 1 ;
s h l . b64 %r l 0 , %r l 0 , 2 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_0 ] ;
add . s64 %r l 1 , %r l 1 , %r l 0 ;
l d . param . u64 %r l 2 , [ sum_kernel__param_1 ] ;
add . s64 %r l 2 , %r l 2 , %r l 0 ;
l d . f32 %f0 , [% r l 2 ] ;
l d . f32 %f1 , [% r l 1 ] ;
add . f32 %f0 , %f1 , %f0 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_2 ] ;
add . s64 %r l 0 , %r l 1 , %r l 0 ;
s t . f32 [% r l 0 ] , %f0 ;
ret ;
}

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 20 / 23
.
http://kernelgen.org/testit/

Please help us to improve the quality and usefulness of KernelGen


The code is open-source and could be easily compiled into binary package

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 21 / 23
.
Technical plan for Stage 3 (Fall 2012)

Compiler core improvements (by priority):

1 Get rid of code inlining before applying loops analysis with Polly
2 Fix crashes of kernels using CUDA math functions on Kepler
3 Solve problems with compilation of big kernels using ptxas
4 Rewrite gpu-cpu data sharing model more efficiently
5 Replace host-assisted loop kernels launching with Kepler K20’s dynamic
parallelism
6 Enable Polly tiling with support of shared memory, loops interchanging and
Kepler’s warp shuffle

Improve usability:

Create Ubuntu PPA repository shipping KernelGen compiler binaries


Testing: NPB, polybench, COSMO radiation, WRF

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 22 / 23
Download link for this presentation:
http://kernelgen.org/ncar2012/

Project mailing list:


[email protected]

Thank you! ,

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 23 / 23

You might also like