0% found this document useful (0 votes)

9 views28 pages

Kernelgen Ncar 2012 Slides

The document discusses the KernelGen project, a prototype compiler designed to auto-parallelize Fortran/C code for NVIDIA GPUs, aiming to minimize manual coding efforts while preserving original source code. It highlights the project's goals, rationale, and various implementation stages, showcasing improvements in parallelism detection and compatibility with existing compilers. The document also details the project's capabilities, including handling external calls and parallelizing while-loops, which are not supported by OpenACC.

Uploaded by

magicland300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

Kernelgen Ncar 2012 Slides

Uploaded by

magicland300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms

National Center for Atmospheric Research, Boulder, Colorado, September 12-13, 2012
.

KernelGen – A prototype of auto-parallelizing

Fortran/C compiler for NVIDIA GPUs

Dmitry Mikushin1,3 Nikolay Likhogrud2,3 Hou Yunqing4

Sergey Kovylov5

1 Institute of Computational Science, University of Lugano

2 Lomonosov Moscow State University

3 Applied Parallel Computing LLC

4 Nanyang Technological University

5 NVIDIA

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 1 / 23
.
KernelGen research project

Goals:

Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
KernelGen research project

Goals:

Conserve the original application source code, keep all GPU-specific things in
the background
Minimize manual work on specific code ⇒ develop a compiler toolchain usable
with many models

Rationale:

Old good programming languages could still be usable, if accurate code analysis
& parallelization methods exist
OpenACC is too restrictive for complex apps and needs more flexibility
GPU tends to become a central processing unit in near future, contradicting
with OpenACC paradigm
NWP is a perfect testbed for novel accelerator programming models

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 2 / 23
.
WRF specifics

Sets of multiple numerical blocks to switch between, depending on model

purpose ⇒ no need to compile all code for GPU at time, JIT-compile only used
parts
Complex compilation system, most of code is compiled to static libraries, many
potential GPU kernels have external dependencies ⇒ needs modified linker to
resolve kernels dependencies at link time

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 3 / 23
.
Project Team

Lomonosov Moscow State

University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project Team

Lomonosov Moscow State

University,
University of Lugano, Applied Parallel
Faculty of Computational
Institute of Computational Science Computing LLC
Mathematics and
Cybernetics

With technical support of many communities:

+ AsFermi, OpenMPI and others

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 4 / 23
.
Project state in September 2011 (v0.1)

Results:

Could successfully generate CUDA and OpenCL kernels out of parallel loops in
Fortran, with lots of limitations
Automatic handling of host-device data transfers, with all process data kept on
host
Better language support than F2C-ACC, but still a lot of issues

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2011 (v0.1)

Results:

Implementation:

Pretty-printed AST – to markup and transform code into host and device parts
No reliable data dependency analysis in loops
LLVM + C Backend – to convert Fortran to C and chain to CUDA compiler

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 5 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
Project state in September 2012 (v0.2 nvptx)
Results:
Can analyze arbitrary loops in C/C++/Fortran for parallelism and generate CUDA
kernels
Better quality of parallelism detection, than OpenACC from PGI
Automatic handling of host-device data transfers, with all process data kept on
device
Full compatibility with conventional GCC compiler and linker

Implementation:
DragonEgg – to emit LLVM IR from C/C++/Fortran
LLVM loop extractor pass – to detect loops in compile time
Modified LLVM Polly – to perform loop analysis in runtime
LLVM NVPTX Backend – to emit PTX ISA directly from LLVM IR
Modified GCC compiler and custom LTO wrapper – to support calling external
functions in loops and link code from static libraries
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 6 / 23
.
KernelGen user interface design

KernelGen is based on GCC and is fully compatible with it

Executable binary preserves host-only version, that is used by default; GPU
version is activated by request
Execution mode is controlled by $kernelgen runmode: 0 – run original CPU
binary, 1 – run GPU version

$ NETCDF=/ opt / kernelgen . / c o n f i g u r e

Please s e l e c t from among the f o l l o w i n g supported platforms .
...
2 7 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( serial )
2 8 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( smpar )
2 9 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA ( dmpar )
3 0 . Linux x86_64 , kernelgen - g f o r t r a n compiler for CUDA (dm+sm )
Enter s e l e c t i o n [ 1 - 3 8 ] : 27
...
$ . / compile em_real
...
$ cd t e s t / em_real /
$ kernelgen_runmode=1 . / r e a l . exe
Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 7 / 23
.
OpenACC: no external calls
OpenACC compilers do not allow calls from different compilation units:

sincos.f90
! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l

function.f90
s i n c o s _ i j k = s i n ( x ) + cos ( y )

pgfortran - f a s t -Mnomain - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . f90 - o ←-

sincos . o
PGF90 -W-0155 - A c c e l e r a t o r region ignored ; see - Minfo messages ( . . / s i n c o s . f90 : 33)
sincos :
33 , A c c e l e r a t o r region ignored
36 , A c c e l e r a t o r r e s t r i c t i o n : f u n c t i o n / procedure c a l l s are not supported
37 , A c c e l e r a t o r r e s t r i c t i o n : unsupported c a l l to s i n c o s _ i j k
0 inform , 1 warnings , 0 severes , 0 f a t a l f o r s i n c o s

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 8 / 23
.
KernelGen: external calls

}
Dependency resolution during linking Support for external calls defined
⇒
Kernels generation in runtime in other objects or static libraries

! $ acc p a r a l l e l
do k = 1 , nz
do j = 1 , ny
do i = 1 , nx
xy ( i , j , k ) = s i n c o s _ i j k ( x ( i , j , k ) , y ( i , j , k ) )
enddo
enddo
enddo
! $ acc end p a r a l l e l

s i n c o s _ i j k = s i n ( x ) + cos ( y )

result
Launching k e r n e l __k ernel ge n_sin co s__ lo op_ 3
blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __ke rne lg en_si ncos_ _l oop_3
__kernelgen _s incos__ loo p_ 3 time = 0.00536099 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 9 / 23
.
OpenACC: no pointers tracking
In Fortran allocatable arrays carry their dimensions. Not the case in C:

sincos.c
void s i n c o s ( i n t nx , i n t ny , i n t nz , f l o a t * x , f l o a t * y , f l o a t * xy )
{
#pragma acc p a r a l l e l
f o r ( i n t k = 0 ; k < nz ; k ++)
f o r ( i n t j = 0 ; j < ny ; j ++)
f o r ( i n t i = 0 ; i < nx ; i ++)
{
i n t i d x = i + nx * j + nx * ny * k ;
xy [ i d x ] = s i n ( x [ i d x ] ) + cos ( y [ i d x ] ) ;
}
...
}

pgcc - f a s t - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . c - o s i n c o s . o

PGC -W-0155 - Compiler f a i l e d to t r a n s l a t e a c c e l e r a t o r region ( see - Minfo messages ) : Could not f i n d a l l o c a t e d - ←-
v a r i a b l e index f o r symbol ( . . / s i n c o s . c : 27)
sincos :
27 , A c c e l e r a t o r k e r n e l generated
28 , Complex loop c a r r i e d dependence of * ( y ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( x ) prevents p a r a l l e l i z a t i o n
Complex loop c a r r i e d dependence of * ( xy ) prevents p a r a l l e l i z a t i o n
...
30 , A c c e l e r a t o r r e s t r i c t i o n : s i z e of the GPU copy of xy i s unknown
...

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 10 / 23
.
KernelGen: smart pointers tracking
Pointer alias analysis is performed in runtime, assisted with addresses substitution.

result

Launching k e r n e l __kernelgen_sincos_loop _8 . preheader

blockDim = { 32 , 16 , 1 }
gridDim = { 16 , 32 , 63 }
F i n i s h i n g k e r n e l __kernelgen_ sincos_loop _8 . preheader
__kernelgen_sincos_loop_8 . preheader time = 0.00528601 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 11 / 23
.
KernelGen: can parallelize while loops

Thanks to the nature of LLVM and Polly, KernelGen can parallelize while-loops
semantically equivalent to for-s (OpenACC can’t):
i = 1
do while ( i . l e . nx )
j = 1
do while ( j . l e . nz )
k = 1
do while ( k . l e . ny )
C( i , j ) = C( i , j ) + A( i , k ) * B( k , j )
k = k + 1
enddo
j = j + 1
enddo
i = i + 1
enddo

Launching k e r n e l __kernelgen_matmul__loop_9
blockDim = { 32 , 32 , 1 }
gridDim = { 2 , 16 , 1 }
F i n i s h i n g k e r n e l __kernelgen_matmul__loop_9
__kernelgen_matmul__loop_9 time = 0.00953514 sec

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 12 / 23
.
Benchmarking: sincos

xy[i,j,k] := sin(x[i,j,k]) + cos(y[i,j,k])

6 5.49
kernel execution time, ms

4.5
4

0
Kernelgen/Fermi PGI/Fermi
(less is better)

PGI 12.6, Fermi – Tesla C2050

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 13 / 23
.
Benchmarking: matmul

PGI is currently faster because of partial reduction in registers:

10 9.41
kernel execution time, ms

8
6.01
6

2 1.01 0.95
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 14 / 23
.
Benchmarking: jacobi

On finite-difference patterns KernelGen performance is better:

30 28.35
kernel execution time, ms

23.43
24 20.42
19.09
18
11.36
12 9.54 9.44 8.5
6
0
PGI/Fermi PGI/Kepler
Kernelgen/Fermi Kernelgen/Kepler
(less is better)

compute kernel data copy kernel

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 15 / 23
.
KernelGen concepts

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
KernelGen concepts

Main GPU and peripheral host-system: initially port on GPU as much parallel
code as possible, without human decision
Fallback to CPU version in case of calls to host-only functions (I/O, syscalls, ...)
or non-parallel loops or inefficient parallel code
Perform transparent host-device data sharing on-demand, keeping all data on
device by default, rather than on host
Use GCC frontends to support major programming languages (Fortran, C/C++,
Ada, etc.)
Unify all languages to the common intermediate representation
Extract potentially parallel loops into kernels during compile-time, but decide
the execution mode, taking in account runtime information (JIT)
Adjust kernel execution mode, using the dynamically collected statistics or use
profile files from previous runs

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 16 / 23
.
LLVM for Fortran & GPU in a nutshell

LLVM – a universal system of programs analysis, transformation and optimization

with RISC-like intermediate representation (LLVM IR SSA)

Frontends
. Backends
LLVM IR
(clang, GHC, ...) (x86, arm, ptx, ...)

Analysis, optimization and transformation passes

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 17 / 23
.
LLVM for Fortran & GPU in a nutshell

Consider the following kernel written in Fortran:

subroutine sum_kernel ( a , b , c , length )

i m p l i c i t none

i n t e g e r : : length
r e a l , dimension ( length ) : : a , b , c
i n t e g e r : : idx , t h r e a d I dx _ x

i d x = t hre a dI dx _ x ( ) + 1

c ( idx ) = a ( idx ) + b( idx )

end subroutine sum_kernel

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 18 / 23
.
LLVM for Fortran & GPU in a nutshell

With help of GCC and DragonEgg it could be translated into LLVM IR:

$ kernelgen - dragonegg k e r n e l . f90 - o - | opt -O3 - S - o k e r n e l . l l

t a r g e t dat a l a y o u t = ” e - p : 6 4 : 6 4 : 6 4 - S128 - i 1 : 8 : 8 - i 8 : 8 : 8 - i 1 6 : 1 6 : 1 6 - i 3 2 : 3 2 : 3 2 - i 6 4 : 6 4 : 6 4 - f16 : 1 6 : 1 6 - f32 : 3 2 : 3 2 - f64 : 6 4 : 6 4 - ←-

f128 : 1 2 8 : 1 2 8 - v64 : 6 4 : 6 4 - v128 : 1 2 8 : 1 2 8 - a0 : 0 : 6 4 - s0 : 6 4 : 6 4 - f80 : 1 2 8 : 1 2 8 - n8 : 1 6 : 3 2 : 6 4 ”
t a r g e t t r i p l e = ” x86_64 - unknown - l i n u x - gnu ”

define void @sum_kernel_ ( [ 0 x f l o a t ] * n o a l i a s nocapture %a, [ 0 x f l o a t ] * n o a l i a s nocapture %b, [ 0 x f l o a t ] * ←-

n o a l i a s nocapture %c , i 3 2 * n o a l i a s nocapture %length ) nounwind uwtable {
entry :
%0 = t a i l c a l l i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( ) nounwind
%1 = add i 3 2 %0, 1
%2 = sext i 3 2 %1 to i 6 4
%3 = add i 6 4 %2, -1
%4 = getelementptr [ 0 x f l o a t ] * %a, i 6 4 0 , i 6 4 %3
%5 = load f l o a t * %4, a l i g n 4
%6 = getelementptr [ 0 x f l o a t ] * %b, i 6 4 0 , i 6 4 %3
%7 = load f l o a t * %6, a l i g n 4
%8 = fadd f l o a t %5, %7
%9 = getelementptr [ 0 x f l o a t ] * %c , i 6 4 0 , i 6 4 %3
s t o r e f l o a t %8, f l o a t * %9, a l i g n 4
r e t void
}

d e c l a r e i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( )

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 19 / 23
.
LLVM for Fortran & GPU in a nutshell
PTX GPU assembly can be emitted from LLVM IR with help of NVPTX backend:
$ l l c - march =” nvptx64 ” -mcpu=”sm_30 ” k e r n e l . l l - o k e r n e l . ptx

. func sum_kernel_ ( . param . b64 sum_kernel__param_0 , . param . b64 sum_kernel__param_1 , . param . b64 ←-
sum_kernel__param_2 , . param . b64 sum_kernel__param_3 ) {
. reg . pred %p<396 >;
. reg . s16 %rc <396 >;
. reg . s16 %rs <396 >;
. reg . s32 %r <396 >;
. reg . s64 %r l <396 >;
. reg . f32 %f <396 >;
. reg . f64 %f l <396 >;
mov . u32 %r0 , %t i d . x ;
add . s32 %r0 , %r0 , 1 ;
c v t . s64 . s32 %r l 0 , %r0 ;
add . s64 %r l 0 , %r l 0 , - 1 ;
s h l . b64 %r l 0 , %r l 0 , 2 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_0 ] ;
add . s64 %r l 1 , %r l 1 , %r l 0 ;
l d . param . u64 %r l 2 , [ sum_kernel__param_1 ] ;
add . s64 %r l 2 , %r l 2 , %r l 0 ;
l d . f32 %f0 , [% r l 2 ] ;
l d . f32 %f1 , [% r l 1 ] ;
add . f32 %f0 , %f1 , %f0 ;
l d . param . u64 %r l 1 , [ sum_kernel__param_2 ] ;
add . s64 %r l 0 , %r l 1 , %r l 0 ;
s t . f32 [% r l 0 ] , %f0 ;
ret ;
}

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 20 / 23
.
http://kernelgen.org/testit/

Please help us to improve the quality and usefulness of KernelGen

The code is open-source and could be easily compiled into binary package

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 21 / 23
.
Technical plan for Stage 3 (Fall 2012)

Compiler core improvements (by priority):

1 Get rid of code inlining before applying loops analysis with Polly
2 Fix crashes of kernels using CUDA math functions on Kepler
3 Solve problems with compilation of big kernels using ptxas
4 Rewrite gpu-cpu data sharing model more efficiently
5 Replace host-assisted loop kernels launching with Kepler K20’s dynamic
parallelism
6 Enable Polly tiling with support of shared memory, loops interchanging and
Kepler’s warp shuffle

Improve usability:

Create Ubuntu PPA repository shipping KernelGen compiler binaries

Testing: NPB, polybench, COSMO radiation, WRF

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 22 / 23
Download link for this presentation:
http://kernelgen.org/ncar2012/

Project mailing list:

[email protected]

Thank you! ,

Dmitry Mikushin et al. (USI/ICS) KernelGen prototype compiler September 12, 2013 23 / 23

Silicon Graphics and Cray Research Supercomputing Application Programming Interface
No ratings yet
Silicon Graphics and Cray Research Supercomputing Application Programming Interface
168 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Case Study On Amazon Ec2
100% (1)
Case Study On Amazon Ec2
30 pages
Embeddedlinux 150206085635 Conversion Gate01
No ratings yet
Embeddedlinux 150206085635 Conversion Gate01
235 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
CMPE 246 Lecture 6 - (Jan.23)
No ratings yet
CMPE 246 Lecture 6 - (Jan.23)
50 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
DSP-Lab REC-553 Final For Print
No ratings yet
DSP-Lab REC-553 Final For Print
59 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
ASPLOS 2021 - Golden Age of Compilers
No ratings yet
ASPLOS 2021 - Golden Age of Compilers
64 pages
ParallelR-Accelerating R Applications With CUDA
No ratings yet
ParallelR-Accelerating R Applications With CUDA
59 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
AdvancedOpenCL Full
No ratings yet
AdvancedOpenCL Full
101 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
OpenACC 2017spring
No ratings yet
OpenACC 2017spring
46 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Logcat
No ratings yet
Logcat
634 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Parallel Computing With Matlab: Sarah Wait Zaranek Application Engineer Mathworks, Inc
No ratings yet
Parallel Computing With Matlab: Sarah Wait Zaranek Application Engineer Mathworks, Inc
44 pages
Petazzoni Toolchain Anatomy
No ratings yet
Petazzoni Toolchain Anatomy
37 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
L03 C Intro
No ratings yet
L03 C Intro
35 pages
iOS Apps With REST APIs PDF
100% (2)
iOS Apps With REST APIs PDF
241 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
3.BADI Implementation For Transaction FB60
No ratings yet
3.BADI Implementation For Transaction FB60
12 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Abstract of The Project College Website
No ratings yet
Abstract of The Project College Website
20 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
Huber A CPlusPlus Toolchain For Your GPU
No ratings yet
Huber A CPlusPlus Toolchain For Your GPU
24 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Anatomy of Cross-Compilation Toolchains: Thomas Petazzoni
No ratings yet
Anatomy of Cross-Compilation Toolchains: Thomas Petazzoni
37 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Using C and C++ With Fortran: Liu Hui April 3, 2006
No ratings yet
Using C and C++ With Fortran: Liu Hui April 3, 2006
25 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
G95 Manual
No ratings yet
G95 Manual
25 pages
23 Hack in Sight 2014
100% (2)
23 Hack in Sight 2014
652 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
No ratings yet
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
19 pages
Cheap and Easy Parallelism For Matlab On Linux Clusters
No ratings yet
Cheap and Easy Parallelism For Matlab On Linux Clusters
6 pages
OpenCL On de Series Boards
No ratings yet
OpenCL On de Series Boards
18 pages
SAS Macro Language Reference
No ratings yet
SAS Macro Language Reference
385 pages
Building A Fortran CLI
No ratings yet
Building A Fortran CLI
6 pages
PDSCUDA
No ratings yet
PDSCUDA
11 pages
From MATLAB To Embedded C: News&Notes
No ratings yet
From MATLAB To Embedded C: News&Notes
4 pages
Chapter 7 INT 21H
No ratings yet
Chapter 7 INT 21H
14 pages
From MATLAB To Embedded C: News&Notes
No ratings yet
From MATLAB To Embedded C: News&Notes
4 pages
RxSwift Reactive Programming With Swift Third Edition Raywenderlich Tutorial Team All Chapter Instant Download
100% (5)
RxSwift Reactive Programming With Swift Third Edition Raywenderlich Tutorial Team All Chapter Instant Download
53 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
Chapter 10 Structures and Unions
No ratings yet
Chapter 10 Structures and Unions
59 pages
Java Practicals
No ratings yet
Java Practicals
43 pages
Cprogramming Mock Test 3
No ratings yet
Cprogramming Mock Test 3
16 pages
Csc121 - Topic 6 Algorithm Design For Programs Using Modules (Functions)
No ratings yet
Csc121 - Topic 6 Algorithm Design For Programs Using Modules (Functions)
32 pages
HTTP Request Response
No ratings yet
HTTP Request Response
35 pages
Addition - Ipynb - Colab
No ratings yet
Addition - Ipynb - Colab
2 pages
Tony Fich
No ratings yet
Tony Fich
1 page
Computer Science Grade 8
No ratings yet
Computer Science Grade 8
3 pages
Wumpus World
No ratings yet
Wumpus World
10 pages
Volksoft Technologies Pvt. LTD
No ratings yet
Volksoft Technologies Pvt. LTD
10 pages
Work With File Handling: © Dukestar Technologies Pvt. Ltd. 1/12
No ratings yet
Work With File Handling: © Dukestar Technologies Pvt. Ltd. 1/12
12 pages
SAP PP Questionnaire: The Following Are Some SAP PP Questionnaires Which You Can Try
No ratings yet
SAP PP Questionnaire: The Following Are Some SAP PP Questionnaires Which You Can Try
5 pages
Star Uml
No ratings yet
Star Uml
9 pages
Rashmi Hem Computer Languages
No ratings yet
Rashmi Hem Computer Languages
20 pages
Togog
No ratings yet
Togog
17 pages
Source Code Explanation
No ratings yet
Source Code Explanation
20 pages
Java Programming Language Lecturer: Kamaluddin Behzad
No ratings yet
Java Programming Language Lecturer: Kamaluddin Behzad
21 pages
Pattern Program-1
No ratings yet
Pattern Program-1
6 pages
React Fibre Architecture
No ratings yet
React Fibre Architecture
7 pages
SA0951a Oracle Practical: PL/SQL Cursors, Procedures, Functions
No ratings yet
SA0951a Oracle Practical: PL/SQL Cursors, Procedures, Functions
7 pages
Advantages of The Software Interfaces
No ratings yet
Advantages of The Software Interfaces
3 pages
Getting Started - Materialize PDF
No ratings yet
Getting Started - Materialize PDF
1 page
Resume Aug9
No ratings yet
Resume Aug9
1 page
Attorney Resume
No ratings yet
Attorney Resume
1 page
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Ravishekhar Banger
No ratings yet
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet

Kernelgen Ncar 2012 Slides

Uploaded by

Kernelgen Ncar 2012 Slides

Uploaded by

Programming weather, climate, and earth-system models on heterogeneous multi-core platforms

KernelGen – A prototype of auto-parallelizing

Dmitry Mikushin1,3 Nikolay Likhogrud2,3 Hou Yunqing4

1 Institute of Computational Science, University of Lugano

2 Lomonosov Moscow State University

3 Applied Parallel Computing LLC

4 Nanyang Technological University

Sets of multiple numerical blocks to switch between, depending on model

Lomonosov Moscow State

Lomonosov Moscow State

With technical support of many communities:

+ AsFermi, OpenMPI and others

KernelGen is based on GCC and is fully compatible with it

$ NETCDF=/ opt / kernelgen . / c o n f i g u r e

pgfortran - f a s t -Mnomain - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . f90 - o ←-

pgcc - f a s t - Minfo= a c c e l - ta = n v i d i a , time -Mcuda=keepgpu , keepbin , keepptx , p t x i n f o - c . . / s i n c o s . c - o s i n c o s . o

Launching k e r n e l __kernelgen_sincos_loop _8 . preheader

xy[i,j,k] := sin(x[i,j,k]) + cos(y[i,j,k])

PGI 12.6, Fermi – Tesla C2050

PGI is currently faster because of partial reduction in registers:

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

On finite-difference patterns KernelGen performance is better:

compute kernel data copy kernel

PGI 12.6, Fermi – Tesla C2050, Kepler – GTX 680M

LLVM – a universal system of programs analysis, transformation and optimization

Analysis, optimization and transformation passes

Consider the following kernel written in Fortran:

subroutine sum_kernel ( a , b , c , length )

c ( idx ) = a ( idx ) + b( idx )

end subroutine sum_kernel

$ kernelgen - dragonegg k e r n e l . f90 - o - | opt -O3 - S - o k e r n e l . l l

t a r g e t dat a l a y o u t = ” e - p : 6 4 : 6 4 : 6 4 - S128 - i 1 : 8 : 8 - i 8 : 8 : 8 - i 1 6 : 1 6 : 1 6 - i 3 2 : 3 2 : 3 2 - i 6 4 : 6 4 : 6 4 - f16 : 1 6 : 1 6 - f32 : 3 2 : 3 2 - f64 : 6 4 : 6 4 - ←-

define void @sum_kernel_ ( [ 0 x f l o a t ] * n o a l i a s nocapture %a, [ 0 x f l o a t ] * n o a l i a s nocapture %b, [ 0 x f l o a t ] * ←-

d e c l a r e i 3 2 @llvm . nvvm . read . ptx . sreg . t i d . x ( )

Please help us to improve the quality and usefulness of KernelGen

Compiler core improvements (by priority):

Create Ubuntu PPA repository shipping KernelGen compiler binaries

Project mailing list:

You might also like