0% found this document useful (0 votes)

5 views

Huber a CPlusPlus Toolchain for Your GPU

Uploaded by

dovaw34446

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Huber a CPlusPlus Toolchain for Your GPU

Uploaded by

dovaw34446

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

A C/C++ Toolchain for

your GPU
Joseph Huber ([email protected])
LLVM Developer’s Conference 2024
Introduction — GPGPU

● Modern GPUs have evolved into general purpose accelerators

● Provide a C/C++ toolchain that runs on the GPU
○ Trivially port applications to run on the GPU
○ Generic implementations of math / utility functions
○ Run unit tests on the GPU
○ It’s also fun
● How to port the existing C/C++ libraries to the GPU?
○ Lots of existing options

2 |
Targeting GPUs — CUDA / HIP

● Ubiquitous for targeting GPUs __global__ void saxpy(int n, float a, float *x, float *y) {
● GPU code is manually declared int i = blockIdx.x * blockDim.x + threadIdx.x;
○ __global__ and __device__ if (i < n)
● Difficult to integrate into existing y[i] = a * x[i] + y[i];
}
build systems
○ One compile job yields many files $ clang++ -x hip hip.cpp --offload-arch=gfx940 -###
○Host and device compilations -cc1 -triple amdgcn-amd-amdhsa … -fcuda-is-device -x hip
must both compile
● Less portable #include "device_amd_hsa.h" // HIP Runtime

● Compiled by the clang frontend

ATTR size_t __ockl_get_local_id(uint dim) {
● Runtime implemented by builtins switch (dim) {
case 0: return __builtin_amdgcn_workitem_id_x();
case 1: return __builtin_amdgcn_workitem_id_y();
case 2: return __builtin_amdgcn_workitem_id_z();
}
}

3 |
Targeting GPUs — OpenMP

● Uses C++ with compiler pragmas void saxpy(int n, float a, float *x, float *y) {
○ #pragma omp declare target #pragma omp target teams distribute parallel for
● More “standard” C++ for (int i = 0; i < n; ++i)
● Same issues with build systems y[i] = a * x[i] + y[i];
}
● Very portable
● Compiled by the clang frontend $ clang++ -x cpp openmp.cpp -fopenmp --offload-arch=gfx940 -###
● Uses the same builtins for the -cc1 -triple amdgcn-amd-amdhsa … -fopenmp-is-target-device
runtime
#include <Mapping.h> // OpenMP Runtime

uint32_t getThreadIdInBlock(int32_t Dim) {

switch (Dim) {
case 0: return __builtin_amdgcn_workitem_id_x();
case 1: return __builtin_amdgcn_workitem_id_y();
case 2: return __builtin_amdgcn_workitem_id_z();
}
}

4 |
Targeting GPUs — OpenCL

● Uses a more conventional approach __kernel void

○ Easier to fit into existing builds saxpy(int n, float a, __global float* x, __global float* y) {
● OpenCL is fundamentally limited int i = get_global_id(0);
○ No function pointers if (i < n) y[i] = a * x[i] + y[i];
○ No recursion }
○ Conflicting definitions of C functions
$ clang++ -x cl opencl.cl --target=amdgcn-- -mcpu=gfx940
○ OpenCLC++ has templates at least
-cc1 -triple amdgcn-amd-amdhsa … -x cl
● Also compiled through the clang
frontend #include <clc/clc.h> // OpenCL Runtime
● The runtime uses the same builtin
functions _CLC_DEF _CLC_OVERLOAD size_t get_local_id(uint dim) {
○ You get the idea switch (dim) {
case 0: return __builtin_amdgcn_workitem_id_x();
case 1: return __builtin_amdgcn_workitem_id_y();
case 2: return __builtin_amdgcn_workitem_id_z();
}
}

5 |
Targeting GPUs — C/C++

● Why bother porting anything in the first place?

● All of these languages are just different versions of clang
○ We can just target C/C++ directly without a GPU language
● Use --target=amdgcn-amd-amdhsa to invoke the clang target
● Our linker is ld.lld so we can use LTO and static libraries
○ Don’t specify -mcpu= and we can get generic LLVM-IR

6 |
Targeting GPUs — ISO C/C++

void matmul(float A, float B, float *C, int N) {

for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
float sum = 0.0;
for (int k = 0; k < N; ++k)
sum += A[i * N + k] * B[k * N + j];
C[i * N + j] = sum;
}
}
}

7 |
Targeting GPUs — C/C++ Extensions
[[clang::amdgpu_kernel]] void matmul(float *A, float *B, float *C, int N) { // Target calling convention
static [[clang::address_space(3)]] float A_s[TILE][TILE]; // Target address space for __shared__
static [[clang::address_space(3)]] float B_s[TILE][TILE]; // Target address space for __shared__

int bx = __builtin_amdgcn_workgroup_id_x(); int tx = __builtin_amdgcn_workitem_id_x(); // Builtin functions

int by = __builtin_amdgcn_workgroup_id_x(); int ty = __builtin_amdgcn_workitem_id_y(); // Builtin functions

for (int ph = 0; ph < N / TILE; ++ph) {

A_s[ty][tx] = A[(by * TILE + ty) * N + ph * TILE + tx];
B_s[ty][tx] = B[(ph * TILE + ty) * N + bx * TILE + tx];
__builtin_amdgcn_s_barrier(); // Builtin functions
float sum = 0.0f;
for (int k = 0; k < TILE; ++k)
sum += A_s[ty][k] * B_s[k][tx];
__builtin_amdgcn_s_barrier(); // Builtin functions
}
C[(by * TILE + ty) * N + bx * TILE + tx] = sum;
}

8 |
Cross Compiling — C/C++

● This is just cross compiling!

○ Plenty of build systems support this natively
● A complete compiler toolchain has…
○ A compiler frontend ✓
○ An assembler ✓
○ A linker ✓
○ Runtime libraries ✗
● Lets port the LLVM C and C++ runtimes to the GPU!

9 |
LLVM Runtimes — Introduction

• Bootstraps multiple libraries using the just-built clang

• Create the libraries for multiple targets
• -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda
• CMake arguments can be passed individually to each runtime target
• -DRUNTIMES_<triple>_<var>=<value>
LLVM Runtimes

x86-64-pc-linux-gnu amdgcn-amd-amdhsa nvptx64-nvidia-cuda

compiler-rt

libcxxabi
offload
openmp

libcxx
libc

libc
10 |
🗀 install
Clang/LLVM — Multilibs ├── 🗀 bin
│ ├── amdhsa-loader
│ └── clang
• Each runtime gets its own directory ├── 🗀 include
│ ├── 🗀 amdgcn-amd-amdhsa
• -DLLVM_ENABLE_PER_TARGET_RUNTIME_DIR=ON │ │ ├── 🗀 c++
• Use the GPU target to create the toolchain │ │ │ └── 🗀 v1
│ │ │ └── __config_site
• Clang will point to the appropriate folder │ │ ├── <libc headers>
• Only need to pass -lm -lc … │ ├── 🗀 c++
│ │ └── 🗀 v1
• Now let’s actually build them │ │ └── <libc++ headers>
└── 🗀 lib
├── 🗀 amdgcn-amd-amdhsa
│ ├── crt1.o
│ ├── libc++.a
│ ├── libc.a
│ ├── libc++abi.a
│ └── libm.a
└── 🗀 clang
└── 🗀 20
└── 🗀 lib
└── 🗀 amdgcn-amd-amdhsa
└── libclang_rt.builtins.a

11 |
LLVM Runtimes — LLVM libc

• The C library is the basis of other applications set(TARGET_LIBC_ENTRYPOINTS

…
• The LLVM C library is highly configurable
# stdio.h entrypoints
• Just enable the functions we want and compile them
libc.src.stdio.printf
• System calls are handled through Remote Procedure Calls libc.src.stdio.vprintf
• Client / server protocol communicating through mutual libc.src.stdio.fprintf
exclusion on unified memory libc.src.stdio.vfprintf
• See my talk last year for more detailed information libc.src.stdio.snprintf
• The LLVM C Library for GPUs libc.src.stdio.sprintf
libc.src.stdio.vsnprintf
libc.src.stdio.vsprintf
libc.src.stdio.asprintf
libc.src.stdio.vasprintf
libc.src.stdio.sscanf
libc.src.stdio.vsscanf
libc.src.stdio.fscanf
…
)

12 |
LLVM Runtimes — LLVM libc

• Make the GPU look like a normal hosted target void call_init_callbacks(int argc, char **argv, char **env) {
/* Call global constructors. */
• Standard libc implementations use a startup }
object (i.e. crt1.o) to call the main function void call_fini_callbacks() { /* Call global destructors. */ }
• Just write one for the GPU
extern "C" {
• Cross compiling emulators run tests [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
_begin(int argc, char **argv, char **env, void *in, void *out)
• Write one for the GPU using the GPU runtime {
atexit(&call_fini_callbacks);
call_init_callbacks(argc, argv, env);
}

[[gnu::visibility("protected"), clang::amdgpu_kernel]] void

_start(int argc, char **argv, char **envp, int *ret) {
__atomic_fetch_or(ret, main(argc, argv, envp),
__ATOMIC_RELAXED);
}

[[gnu::visibility("protected"), clang::amdgpu_kernel]] void

_end(int retval) {
exit(retval);
}
}

13 |
LLVM Runtimes — Compiler-RT

• Provides a runtime library that Clang will implicitly call

• libclang_rt.builtins.a
• Porting is very straightforward
• Just enable the runtimes build and set a few flags
• This gives us a functional C toolchain for the GPU
• Now we can compile some more interesting things for the GPU
#include <stdio.h>

int main(int argc, char **argv) {

fprintf(stdout, "%s %d\n", argv[0] , __builtin_amdgcn_workgroup_id_x());
}

$> clang app.c --target=amdgcn-amd-amdhsa -flto -mcpu=native -lc -lm -lclang_rt.builtins crt1.o -nogpulib
$> amdhsa-loader --blocks 3 ./a.out
./a.out 1
./a.out 0
./a.out 2

14 |
LLVM Runtimes — libc++

• Build the C++ libraries on top of the C toolchain

• Disable some things we can’t handle right now
• Threads and Filesystem support mostly
• Lots of config options so we provide a cache file
• -DRUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES=libcxx/caches/AMDGPU.cmake
• Other approaches exist but typically require a separate include path

15 |
LLVM Runtimes — libc++ example

#include <...>

int main(int argc, char **argv) {

std::mt19937 generator(__builtin_amdgcn_workitem_id_x());
std::uniform_int_distribution<int> dist(1, 100);

std::vector<int> vec(8);
std::ranges::generate(vec, [&]() { return dist(generator); });

for (int x : vec)

std::cout << x << " ";
}

$> clang++ app.cpp --target=amdgcn-amd-amdhsa -flto -mcpu=native -lc -lm -lc++ -lc++abi \
crt1.o -lclang_rt.builtins -stdlib=libc++ -nogpulib -fno-exceptions
$> amdhsa-loader ./a.out
45 48 65 68 68 10 84 22

16 |
LLVM Runtimes — Testing libc++

• libc++ already has support for custom test $> ninja -C

runtimes/runtimes-amdgcn-amd-amdhsa-bins check-cxx
executors
• We can run the test suite on the GPU for free Testing Time: 1092.29s
• Found a lot of obscure compiler bugs Total Discovered Tests: 9749
Unsupported : 2743 (28.14%)
• Still a lot of failures but it’s a start Passed : 6804 (69.79%)
Expectedly Failed: 48 (0.49%)
Failed : 154 (1.58%)

17 |
LLVM Runtimes — Bringing to Offloading Languages

• Must specify which definitions are available to the device

• We export a static library in the target-specific directory
• Just link it with the device-side compilation, the compiler knows where to find it
• All these languages use the same linker and toolchain remember?

#include <iostream>

#pragma omp declare target to(std::cout)

int main() {
#pragma omp target
std::cout << "Hello World\n";
}

$> clang++ hello.cpp -fopenmp --offload-arch=native -Xoffload-linker -lc++ \

-Xoffload-linker -lc++abi -Xoffload-linker -lc -fno-exceptions -stdlib=libc++
$> ./a.out
Hello World

18 |
LLVM Runtimes — Bringing it all Together

• We can build a functional C/C++ toolchain that targets the GPU

• The magic spell to summon it

$> cd llvm-project # The llvm-project checkout

$> mkdir build
$> cd build
$> cmake ../llvm -G Ninja \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DCMAKE_BUILD_TYPE=<Debug|Release> \ # Select build type
-DCMAKE_INSTALL_PREFIX=<PATH> \ # Where the libraries will live
-DRUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES="./libcxx/cmake/caches/AMDGPU.cmake" \
-DRUNTIMES_nvptx64-nvidia-cuda_CACHE_FILES="./libcxx/cmake/caches/NVPTX.cmake" \
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES="libc;compiler-rt;libcxx;libcxxabi" \
-DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES="libc;compiler-rt;libcxx;libcxxabi" \
-DLLVM_RUNTIME_TARGETS="default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda"
$> ninja install

19 |
Challenges

• How to handle <mutex> and <thread>?

• GPUs have limited forward progress guarantees
• Cooperative launches can probably help
• Compile times!
• Really bad because we need LTO for both performance and portability
• HIP/CUDA/OpenMP must compile the same headers on the CPU/GPU
• Things must be explicitly declared on the GPU
• What if we copy something from the CPU to the GPU?
• std::vector calls realloc on the GPU, instant segfault
• std::mutex calls futex, GPU cannot notify the thread
• Not as many useful functions in libc++
• Huge resource requirements when compared to libc
• Should we provide exceptions?
• NVIDIA is limited by PTX and nvlink ¯\_(ツ)_/¯
• …

20 |
Running DOOM on the GPU
https://github.com/jhuber6/doomgeneric
DOOM — Demo

22 |
[Public]

Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and
typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but
not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has
risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct
or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content
hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE
CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY
APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR
ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY
INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD Arrow logo, EPYC, Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc. PCIe is a registered
trademark of PCI-SIG Corporation. Other product names used in this publication are for identification purposes only and may be trademarks
of their respective companies.

23 |

SAP SD Contract Creation User Manual
100% (1)
SAP SD Contract Creation User Manual
52 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
3458744.3473357
No ratings yet
3458744.3473357
8 pages
Part2 22
No ratings yet
Part2 22
97 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
LLVM Clang - Advancing Compiler Technology
No ratings yet
LLVM Clang - Advancing Compiler Technology
28 pages
Gcc4.4.4 Multilib Toolchain Release Note
No ratings yet
Gcc4.4.4 Multilib Toolchain Release Note
12 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
3-computation
No ratings yet
3-computation
28 pages
CuPrintf Readme
No ratings yet
CuPrintf Readme
6 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Native Shader Compilation With LLVM PDF
No ratings yet
Native Shader Compilation With LLVM PDF
37 pages
Week 11
No ratings yet
Week 11
21 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Lec 1
No ratings yet
Lec 1
27 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
chapter-8
No ratings yet
chapter-8
58 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
AdvancedOpenCL Full
No ratings yet
AdvancedOpenCL Full
101 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
cs179_2024_lec01
No ratings yet
cs179_2024_lec01
26 pages
Cuda On CL Iwocl2017
No ratings yet
Cuda On CL Iwocl2017
4 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
hw2
No ratings yet
hw2
12 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
X-Compil Clang&LLVM Toolchain
No ratings yet
X-Compil Clang&LLVM Toolchain
40 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Cuda Basics
No ratings yet
Cuda Basics
44 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
SkatingScheduleOBRC_Fall_2024_0
No ratings yet
SkatingScheduleOBRC_Fall_2024_0
1 page
Sticky-Wicket-Food-Menu-Nov-12
No ratings yet
Sticky-Wicket-Food-Menu-Nov-12
3 pages
lunch-and-dinner
No ratings yet
lunch-and-dinner
2 pages
Sticky-Wicket-Late-Night-Food-April-27-2003
No ratings yet
Sticky-Wicket-Late-Night-Food-April-27-2003
1 page
EmbeddedSystemsGlossary
No ratings yet
EmbeddedSystemsGlossary
4 pages
LOT-1-LATE-WS
No ratings yet
LOT-1-LATE-WS
1 page
Advanced Embedded Systems a Guide for Engineers and Managers
No ratings yet
Advanced Embedded Systems a Guide for Engineers and Managers
57 pages
SoC ADV Embedded Advanced 2023
No ratings yet
SoC ADV Embedded Advanced 2023
3 pages
Updated M.tech Complete Syllabus and Course Structure
No ratings yet
Updated M.tech Complete Syllabus and Course Structure
32 pages
Debugger Microblaze
No ratings yet
Debugger Microblaze
49 pages
Installing Snort 2.8.6.1 On Windows 7
No ratings yet
Installing Snort 2.8.6.1 On Windows 7
15 pages
CSW QMS 2002 SDP 0909 Software Development
100% (1)
CSW QMS 2002 SDP 0909 Software Development
26 pages
How To Back Up A Raspberry Pi SD Card - The Best Ways - All3DP
No ratings yet
How To Back Up A Raspberry Pi SD Card - The Best Ways - All3DP
1 page
Tutorial 3 F2F PDF PDF
No ratings yet
Tutorial 3 F2F PDF PDF
2 pages
Asif Rehman CV
0% (1)
Asif Rehman CV
3 pages
HyperBlade Manual Win EN PDF
No ratings yet
HyperBlade Manual Win EN PDF
15 pages
Trace - 2024-04-16 05 - 52 - 50 116
No ratings yet
Trace - 2024-04-16 05 - 52 - 50 116
6 pages
Global
No ratings yet
Global
6 pages
FGST0019 Red Aesthetics Google Slides Templates
No ratings yet
FGST0019 Red Aesthetics Google Slides Templates
39 pages
A Practical Guide To Git and GitHub For Windows Users - From - Roberto Vormittag - 2016 - Reiter Consulting
No ratings yet
A Practical Guide To Git and GitHub For Windows Users - From - Roberto Vormittag - 2016 - Reiter Consulting
178 pages
ELCAD 7 - Functions, Scaling, Options
No ratings yet
ELCAD 7 - Functions, Scaling, Options
20 pages
InkFormulation 6 Manual en
100% (1)
InkFormulation 6 Manual en
230 pages
Scan To Folder Setup Tool For SM Ben SG
No ratings yet
Scan To Folder Setup Tool For SM Ben SG
12 pages
Stewart David B 1992 1
No ratings yet
Stewart David B 1992 1
21 pages
Quality Analysis Engineer: Work Experience
No ratings yet
Quality Analysis Engineer: Work Experience
2 pages
Implementing PWX Oracle CDC
100% (1)
Implementing PWX Oracle CDC
30 pages
Lfi - Apache Access Logs01
No ratings yet
Lfi - Apache Access Logs01
8 pages
2i Attachment 2 - Checkmarx 12.17.19
No ratings yet
2i Attachment 2 - Checkmarx 12.17.19
2 pages
FND APIs
100% (1)
FND APIs
3 pages
Easyclient Basics
No ratings yet
Easyclient Basics
36 pages
Fourty Plus Bundle
No ratings yet
Fourty Plus Bundle
7 pages
C C++ Language
No ratings yet
C C++ Language
471 pages
Tutoriel 2 - Shape and Combine Data in Power BI Desktop
No ratings yet
Tutoriel 2 - Shape and Combine Data in Power BI Desktop
19 pages
Tech Soft 3D Introduces VizStreamer: A Seamless Path to Web-Based CAE Visualization
No ratings yet
Tech Soft 3D Introduces VizStreamer: A Seamless Path to Web-Based CAE Visualization
4 pages
Powerbuilder Guide 11.0
No ratings yet
Powerbuilder Guide 11.0
1,224 pages
Ctouch ReleaseNotes en
No ratings yet
Ctouch ReleaseNotes en
4 pages
G10 & G11 Worksheet
No ratings yet
G10 & G11 Worksheet
7 pages

Huber a CPlusPlus Toolchain for Your GPU

Uploaded by

Huber a CPlusPlus Toolchain for Your GPU

Uploaded by

A C/C++ Toolchain for

● Modern GPUs have evolved into general purpose accelerators

● Compiled by the clang frontend

uint32_t getThreadIdInBlock(int32_t Dim) {

● Uses a more conventional approach __kernel void

● Why bother porting anything in the first place?

void matmul(float A, float B, float *C, int N) {

int bx = __builtin_amdgcn_workgroup_id_x(); int tx = __builtin_amdgcn_workitem_id_x(); // Builtin functions

for (int ph = 0; ph < N / TILE; ++ph) {

● This is just cross compiling!

• Bootstraps multiple libraries using the just-built clang

x86-64-pc-linux-gnu amdgcn-amd-amdhsa nvptx64-nvidia-cuda

• The C library is the basis of other applications set(TARGET_LIBC_ENTRYPOINTS

[[gnu::visibility("protected"), clang::amdgpu_kernel]] void

[[gnu::visibility("protected"), clang::amdgpu_kernel]] void

• Provides a runtime library that Clang will implicitly call

int main(int argc, char **argv) {

• Build the C++ libraries on top of the C toolchain

int main(int argc, char **argv) {

for (int x : vec)

• libc++ already has support for custom test $> ninja -C

• Must specify which definitions are available to the device

#pragma omp declare target to(std::cout)

$> clang++ hello.cpp -fopenmp --offload-arch=native -Xoffload-linker -lc++ \

• We can build a functional C/C++ toolchain that targets the GPU

$> cd llvm-project # The llvm-project checkout

• How to handle <mutex> and <thread>?

© 2024 Advanced Micro Devices, Inc. All rights reserved.

You might also like