Previously, there was cudaDeviceSynchronize, but cudaDeviceSynchronize can’t be used in RTX 5070 card, finally I find the cudaStreamTailLaunch in CDP2. but I don’t know how to use it.
I understand that it means putting all the statements after device_child_kernel to a global function, and launch it to the tail stream,but it’s in a for loop. so could you help me.
I suggest proper formatting of code. A simple set of steps is to edit your post by clicking on the pencil icon below it, then select all the code, then click the </> button at the top of the edit pane, then save your changes.
Please do that now.
Not all transformations of the usage of cudaDeviceSynchronize() to remove it from CDP code will be trivial.
In your case, since you appear to be launching 100 blocks each consisting of 1 thread, and that one thread in each block is launching a child kernel (in a loop), my initial suggestion would be to get rid of CDP altogether. and launch a kernel of 100 blocks of 9 threads each.
Your parent kernel code is basically trivial. So we can collapse it all to something like:
// a is a host allocation of 900 float quantities/array
__global__ void kernel(float *a){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
for (int i = 0; i < 20; i++){
a[idx] = v(); // represents some formula calc
__syncthreads();
//find min of a
min_a = min(a);
if (min_a > threshold)
{
break;
}
//other device process
}
}
thanks for your reply.
I will try your suggestion of launch a kernel of 100 blocks of 9 threads each.
But about the use of cudaStreamTailLaunch in CDP2. I still don’t understand his usage. I understand cudaStreamTailLaunch can’t replace cudaDeviceSynchronize to get the result of device_child_kernel. I understand that cudaStreamTailLaunch means putting all the statements after device_child_kernel to a global function, and launch it to the tail stream.
Is my understanding correct? Thank you.
Yes, tail launch forces the work you put in the tail launch kernel to wait until all (ordinary) parent kernel activity is complete.
My suggestion is perhaps partly based on the idea that refactoring the cudaDeviceSynchronize() out of the CDP kernel is not always trivial. But I believe a more important consideration is performance.
Launching a kernel with a <<<100,1>>> configuration is not a good path to performance, in my opinion, and I would say the same thing for <<<1,9>>>. (And <<<100,9>>> although better, is probably still a long way from getting great performance out of a modern GPU.) CDP doesn’t generally make such concerns disappear. These statements are not categorical facts, of course. They are just based on my experiences, and the examples I have seen and looked at.