About cudaStreamTailLaunch

wanli.teng · June 29, 2025, 7:54am

Hi everyone:
I’m using RTX5070 card to write dynamic parallel cuda code. The code is like belows:

global void device_child_kernel(float* a)
{
a[threadIdx.x] = v; //v is calculated by some formula
}

void device_func()
{
float* a = new float[9];
for (int i = 0; i < 20; i++
{
device_child_kernel << <1, 9 >> > (a);

	//cudaDeviceSynchronize(); can not use in cc120

	//find min of a
	min_a = min(a);
	if (min_a > threshold)
	{
		break;
	}

	//other device process
}

delete[] a;

}
global void parent_kernel()
{
device_func();
}

int main()
{
parent_kernel << <100, 1 >> > ();
return 0;
}

Previously, there was cudaDeviceSynchronize, but cudaDeviceSynchronize can’t be used in RTX 5070 card, finally I find the cudaStreamTailLaunch in CDP2. but I don’t know how to use it.
I understand that it means putting all the statements after device_child_kernel to a global function, and launch it to the tail stream，but it’s in a for loop. so could you help me.

Robert_Crovella · July 2, 2025, 2:28am

I suggest proper formatting of code. A simple set of steps is to edit your post by clicking on the pencil icon below it, then select all the code, then click the </> button at the top of the edit pane, then save your changes.

Please do that now.

Not all transformations of the usage of cudaDeviceSynchronize() to remove it from CDP code will be trivial.

In your case, since you appear to be launching 100 blocks each consisting of 1 thread, and that one thread in each block is launching a child kernel (in a loop), my initial suggestion would be to get rid of CDP altogether. and launch a kernel of 100 blocks of 9 threads each.

Your parent kernel code is basically trivial. So we can collapse it all to something like:

 // a is a host allocation of 900 float quantities/array
__global__ void kernel(float *a){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  for (int i = 0; i < 20; i++){
    a[idx] = v();  // represents some formula calc
    __syncthreads(); 
	//find min of a
	min_a = min(a);
	if (min_a > threshold)
	{
	break;
    }

	//other device process
  }
}

wanli.teng · July 3, 2025, 3:31pm

thanks for your reply.
I will try your suggestion of launch a kernel of 100 blocks of 9 threads each.
But about the use of cudaStreamTailLaunch in CDP2. I still don’t understand his usage. I understand cudaStreamTailLaunch can’t replace cudaDeviceSynchronize to get the result of device_child_kernel. I understand that cudaStreamTailLaunch means putting all the statements after device_child_kernel to a global function, and launch it to the tail stream.
Is my understanding correct? Thank you.

Robert_Crovella · July 3, 2025, 3:57pm

Yes, tail launch forces the work you put in the tail launch kernel to wait until all (ordinary) parent kernel activity is complete.

My suggestion is perhaps partly based on the idea that refactoring the cudaDeviceSynchronize() out of the CDP kernel is not always trivial. But I believe a more important consideration is performance.

Launching a kernel with a <<<100,1>>> configuration is not a good path to performance, in my opinion, and I would say the same thing for <<<1,9>>>. (And <<<100,9>>> although better, is probably still a long way from getting great performance out of a modern GPU.) CDP doesn’t generally make such concerns disappear. These statements are not categorical facts, of course. They are just based on my experiences, and the examples I have seen and looked at.

system · July 17, 2025, 3:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dynamic Parallelism synchronization between kernel launches CUDA Programming and Performance	5	120	February 5, 2025
Dynamic Parallelism, HyperQ and cudaDeviceSynchronize() CUDA Programming and Performance	1	1040	June 23, 2016
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	857	February 1, 2024
cudaDeviceSynchronize from device code is deprecated CUDA Programming and Performance	15	7130	March 18, 2024
Synchronization methods? CUDA Programming and Performance	11	2157	November 7, 2010
Kernel won't start until cudaDeviceSynchronize() is called CUDA Programming and Performance	1	585	December 17, 2017
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	992	August 7, 2016
cudaStreamSynchronize problem on Tesla T4 TensorRT	2	589	July 24, 2020
Regarding where to place the api cudaStreamSynchronize() while looping Jetson AGX Orin cuda	2	84	June 26, 2024
device streams CUDA Programming and Performance	10	4302	February 7, 2016

About cudaStreamTailLaunch

Related topics