argument list in CUDA kernel??

Dear LB community,

I am writing my first LBM code using CUDA on Win7, 64x, MS-VS 2008.

Thanks to Dr. Tolke for a well written article on using CUDA kernel for LBM.

Is the there any limit on numbers of arguments I can pass in CUDA kernel? I am trying to pass 43 arguments of size N*sizeof(float), where, N is 6561.

cuda_kernel<<nblocks, block_size>>(arg1, arg2, arg3, arg4,…argn)…

My code finish successfully , but at the end there is following error message in the debug output window at the bottom of MS-VS 2008. The results are not good.

First-chance exception at 0x000007fefd68aa7d in test.exe: Microsoft C++ exception: cudaError_enum at memory location 0x002bdc08…
First-chance exception at 0x000007fefd68aa7d in test.exe: Microsoft C++ exception: cudaError_enum at memory location 0x002bdc08…

This code works fine when I try to pass 40 arg of size N*sizeof(float), where N=6561.

I will appreciate any input to understand and potentially solve this problem.

Thanks
Shadab

following is my CUDA kernel call
int frame_rate, N=81*81;

size_t size = N*sizeof(float);

int blockSize = 512;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

LBM_comp <<< nBlocks, blockSize >>> (rho0_d, ux0_d, uy0_d, rho1_d, ux1_d, uy1_d, N,
f0_d, f1_d, f2_d, f3_d, f4_d, f5_d, f6_d, f7_d, f8_d, ftemp0_d, ftemp1_d, ftemp2_d, ftemp3_d, ftemp4_d, ftemp5_d, ftemp6_d, ftemp7_d, ftemp8_d,
g0_d, g1_d, g2_d, g3_d, g4_d, g5_d, g6_d, g7_d, g8_d, gtemp0_d, gtemp1_d, gtemp2_d, gtemp3_d, gtemp4_d, gtemp5_d, gtemp6_d, gtemp7_d, gtemp8_d,
is_solid_d, frame_rate);

% f and g’s are pdf’s for two components.

Following is my CUDA memory allocation for all the arguments:

cudaMalloc((void **) &rho0_d, size);
cudaMalloc((void **) &ux0_d, size);
cudaMalloc((void **) &uy0_d, size);
cudaMalloc((void **) &is_solid_d, size);
cudaMalloc((void **) &f0_d, size); cudaMalloc((void **) &ftemp0_d, size);
cudaMalloc((void **) &f1_d, size); cudaMalloc((void **) &ftemp1_d, size);
cudaMalloc((void **) &f2_d, size); cudaMalloc((void **) &ftemp2_d, size);
cudaMalloc((void **) &f3_d, size); cudaMalloc((void **) &ftemp3_d, size);
cudaMalloc((void **) &f4_d, size); cudaMalloc((void **) &ftemp4_d, size);
cudaMalloc((void **) &f5_d, size); cudaMalloc((void **) &ftemp5_d, size);
cudaMalloc((void **) &f6_d, size); cudaMalloc((void **) &ftemp6_d, size);
cudaMalloc((void **) &f7_d, size); cudaMalloc((void **) &ftemp7_d, size);
cudaMalloc((void **) &f8_d, size); cudaMalloc((void **) &ftemp8_d, size);

cudaMalloc((void **) &rho1_d, size);
cudaMalloc((void **) &ux1_d, size);
cudaMalloc((void **) &uy1_d, size);
cudaMalloc((void **) &g0_d, size); cudaMalloc((void **) &gtemp0_d, size);
cudaMalloc((void **) &g1_d, size); cudaMalloc((void **) &gtemp1_d, size);
cudaMalloc((void **) &g2_d, size); cudaMalloc((void **) &gtemp2_d, size);
cudaMalloc((void **) &g3_d, size); cudaMalloc((void **) &gtemp3_d, size);
cudaMalloc((void **) &g4_d, size); cudaMalloc((void **) &gtemp4_d, size);
cudaMalloc((void **) &g5_d, size); cudaMalloc((void **) &gtemp5_d, size);
cudaMalloc((void **) &g6_d, size); cudaMalloc((void **) &gtemp6_d, size);
cudaMalloc((void **) &g7_d, size); cudaMalloc((void **) &gtemp7_d, size);
cudaMalloc((void **) &g8_d, size); cudaMalloc((void **) &gtemp8_d, size);

Hi shadab,

I’m not sure if there is a limitation on the number of arguments you can pass to a kernel in CUDA. However you might be able to reduce the number of arguments using one linear array for all of your pdf’s.

this would look for example like
cudaMalloc((void **) &f_d, size * 9 );

or for the velocity array
cudaMalloc((void **) &u0_d, size * 2 );

then, you can access the desired particle distribution via
f_d[ n * size + cellindex ]
where n is the particle distribution you want to access ( n in [0…8] )

analog the access for the velocity components would look like:
u_x = u0_d[ 0 * size + cellindex ]
u_y = u0_d[ 1 * size + cellindex ]

I hope this helps,
Thomas

Hi Shadab,

If you need OpenCL code based on this paper I can share with you.

Alex

Hi guys,

Any chance I could get a copy of the OpenCL code mentioned. I played with the LBM using CUDA and got some encourageing results, I’d be interested to see how the OpenCL implementation compares with the CUDA implementation

Cheers,
Bruce

Hi there,

just like Bruce I’m also interested in the OpenCL implementation. If there is a chance to get it I’d be very pleased…

Regards,
Thomas.

Hi guys,

Write me to email shurik.kuzmin@gmail.com - I will send you.

Alex