Huber a CPlusPlus Toolchain for Your GPU
Huber a CPlusPlus Toolchain for Your GPU
your GPU
Joseph Huber ([email protected])
LLVM Developer’s Conference 2024
Introduction — GPGPU
2 |
Targeting GPUs — CUDA / HIP
● Ubiquitous for targeting GPUs __global__ void saxpy(int n, float a, float *x, float *y) {
● GPU code is manually declared int i = blockIdx.x * blockDim.x + threadIdx.x;
○ __global__ and __device__ if (i < n)
● Difficult to integrate into existing y[i] = a * x[i] + y[i];
}
build systems
○ One compile job yields many files $ clang++ -x hip hip.cpp --offload-arch=gfx940 -###
○Host and device compilations -cc1 -triple amdgcn-amd-amdhsa … -fcuda-is-device -x hip
must both compile
● Less portable #include "device_amd_hsa.h" // HIP Runtime
3 |
Targeting GPUs — OpenMP
● Uses C++ with compiler pragmas void saxpy(int n, float a, float *x, float *y) {
○ #pragma omp declare target #pragma omp target teams distribute parallel for
● More “standard” C++ for (int i = 0; i < n; ++i)
● Same issues with build systems y[i] = a * x[i] + y[i];
}
● Very portable
● Compiled by the clang frontend $ clang++ -x cpp openmp.cpp -fopenmp --offload-arch=gfx940 -###
● Uses the same builtins for the -cc1 -triple amdgcn-amd-amdhsa … -fopenmp-is-target-device
runtime
#include <Mapping.h> // OpenMP Runtime
4 |
Targeting GPUs — OpenCL
5 |
Targeting GPUs — C/C++
6 |
Targeting GPUs — ISO C/C++
7 |
Targeting GPUs — C/C++ Extensions
[[clang::amdgpu_kernel]] void matmul(float *A, float *B, float *C, int N) { // Target calling convention
static [[clang::address_space(3)]] float A_s[TILE][TILE]; // Target address space for __shared__
static [[clang::address_space(3)]] float B_s[TILE][TILE]; // Target address space for __shared__
8 |
Cross Compiling — C/C++
9 |
LLVM Runtimes — Introduction
compiler-rt
libcxxabi
offload
openmp
libcxx
libc
libc
10 |
🗀 install
Clang/LLVM — Multilibs ├── 🗀 bin
│ ├── amdhsa-loader
│ └── clang
• Each runtime gets its own directory ├── 🗀 include
│ ├── 🗀 amdgcn-amd-amdhsa
• -DLLVM_ENABLE_PER_TARGET_RUNTIME_DIR=ON │ │ ├── 🗀 c++
• Use the GPU target to create the toolchain │ │ │ └── 🗀 v1
│ │ │ └── __config_site
• Clang will point to the appropriate folder │ │ ├── <libc headers>
• Only need to pass -lm -lc … │ ├── 🗀 c++
│ │ └── 🗀 v1
• Now let’s actually build them │ │ └── <libc++ headers>
└── 🗀 lib
├── 🗀 amdgcn-amd-amdhsa
│ ├── crt1.o
│ ├── libc++.a
│ ├── libc.a
│ ├── libc++abi.a
│ └── libm.a
└── 🗀 clang
└── 🗀 20
└── 🗀 lib
└── 🗀 amdgcn-amd-amdhsa
└── libclang_rt.builtins.a
11 |
LLVM Runtimes — LLVM libc
12 |
LLVM Runtimes — LLVM libc
• Make the GPU look like a normal hosted target void call_init_callbacks(int argc, char **argv, char **env) {
/* Call global constructors. */
• Standard libc implementations use a startup }
object (i.e. crt1.o) to call the main function void call_fini_callbacks() { /* Call global destructors. */ }
• Just write one for the GPU
extern "C" {
• Cross compiling emulators run tests [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
_begin(int argc, char **argv, char **env, void *in, void *out)
• Write one for the GPU using the GPU runtime {
atexit(&call_fini_callbacks);
call_init_callbacks(argc, argv, env);
}
13 |
LLVM Runtimes — Compiler-RT
$> clang app.c --target=amdgcn-amd-amdhsa -flto -mcpu=native -lc -lm -lclang_rt.builtins crt1.o -nogpulib
$> amdhsa-loader --blocks 3 ./a.out
./a.out 1
./a.out 0
./a.out 2
14 |
LLVM Runtimes — libc++
15 |
LLVM Runtimes — libc++ example
#include <...>
std::vector<int> vec(8);
std::ranges::generate(vec, [&]() { return dist(generator); });
$> clang++ app.cpp --target=amdgcn-amd-amdhsa -flto -mcpu=native -lc -lm -lc++ -lc++abi \
crt1.o -lclang_rt.builtins -stdlib=libc++ -nogpulib -fno-exceptions
$> amdhsa-loader ./a.out
45 48 65 68 68 10 84 22
16 |
LLVM Runtimes — Testing libc++
17 |
LLVM Runtimes — Bringing to Offloading Languages
#include <iostream>
int main() {
#pragma omp target
std::cout << "Hello World\n";
}
18 |
LLVM Runtimes — Bringing it all Together
19 |
Challenges
20 |
Running DOOM on the GPU
https://github.com/jhuber6/doomgeneric
DOOM — Demo
22 |
[Public]
Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and
typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but
not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has
risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct
or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content
hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE
CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY
APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR
ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY
INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
23 |