0% found this document useful (0 votes)
58 views

Concurrent Kernel in OpenCL

To execute two kernels concurrently using the same memory in OpenCL: 1) Enqueue both kernels on a concurrent command queue using clEnqueueNDRangeKernel. 2) Pass the event objects from each kernel to buffer read/map calls to read results on the host. 3) Not all hardware supports true concurrent execution, kernels may be serialized. 4) For simple element-wise operations, use a float2 vector data type in a single kernel for efficiency.

Uploaded by

sdancer75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Concurrent Kernel in OpenCL

To execute two kernels concurrently using the same memory in OpenCL: 1) Enqueue both kernels on a concurrent command queue using clEnqueueNDRangeKernel. 2) Pass the event objects from each kernel to buffer read/map calls to read results on the host. 3) Not all hardware supports true concurrent execution, kernels may be serialized. 4) For simple element-wise operations, use a float2 vector data type in a single kernel for efficiency.

Uploaded by

sdancer75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

5/20/2021 concurrent kernel in OpenCL

concurrent kernel in OpenCL


[0] [1] hanarce
[2018-11-16 15:58:11]
[ memory kernel opencl ]
[ https://stackoverflow.com/questions/53341388/concurrent-kernel-in-opencl ]

I would like know how can I execute two or more different kernels in parallel and at the same time? Obviously in the same GPU using OpenCL. My main idea is to use two
different kernels (kernel A and Kernel B) but they need to use the same memory (I do not want to duplicate the memory by using one buffer for each in the “a” and “b” pointers).
So is there another way I can accomplish the dual execution with an efficient memory technique? The codes of the kernels are the following: Kernel A:

_kernel void kernelA(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* c, int cLen0) {
int i = get_local_id(0);
c[(i)].x = a[(i)].x + b[(i)].x; }

Kernel B:

_kernel void kernelB(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* d, int cLen0){ int i = get_local_id(0); d[(i)].y = a[(i)]

The definition for the struct VectorStruct is the following:

struct VectorStruct { int x; int y; };

In the host code I have to create four pointers: VectorStruct* a VectorStruct* b VectorStruct* c VectorStruct* d The poiner “a” and “b” have the data that I will transfer to GPU.
The pointer “c” will storage the results of the kernel A, and the pointer “d” will storage the results of the kernel B.

[0] [2018-11-16 16:29:50] pmdj [ ACCEPTED]

You can enqueue your 2 kernels with clEnqueueNDRangeKernel() on a concurrent command queue, i.e. one where CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE was passed during the
clCreateCommandQueue [1]. Then pass both created event objects to the buffer read or map call for reading out the result from the host. Note that not all hardware and OpenCL implementations
supports concurrent execution of different kernels, so they may end up being serialised to some extent after all.

You can also achieve something similar with multiple serial command queues.

For your simple kernel it may be better to use a float2 to represent your vector and perform a vectorised (SIMD) addition in a single kernel. The OpenCL compiler should pick up on the vector
operations and distribute the operations across the parallel hardware automatically.

For slightly more complicated operations where this doesn't work so well, you could represent the vector's x and y coordinates as a 2-element array, and simply enqueue twice the number of work-
items on one kernel that works on alternating dimensions.

Both approaches will give you much more efficient memory access patterns.

Note that your use of get_local_id(0) might be erroneus, depending on what you want to achieve - you probably want to be using get_global_id(0) in this case.

[1] https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
1

www.stackprinter.com/export?question=53341388&service=stackoverflow 1/1

You might also like