argument list in CUDA kernel??

shadab · February 6, 2010, 12:06pm

Dear LB community,

I am writing my first LBM code using CUDA on Win7, 64x, MS-VS 2008.

Thanks to Dr. Tolke for a well written article on using CUDA kernel for LBM.

Is the there any limit on numbers of arguments I can pass in CUDA kernel? I am trying to pass 43 arguments of size N*sizeof(float), where, N is 6561.

cuda_kernel<<nblocks, block_size>>(arg1, arg2, arg3, arg4,…argn)…

My code finish successfully , but at the end there is following error message in the debug output window at the bottom of MS-VS 2008. The results are not good.

First-chance exception at 0x000007fefd68aa7d in test.exe: Microsoft C++ exception: cudaError_enum at memory location 0x002bdc08…
First-chance exception at 0x000007fefd68aa7d in test.exe: Microsoft C++ exception: cudaError_enum at memory location 0x002bdc08…

This code works fine when I try to pass 40 arg of size N*sizeof(float), where N=6561.

I will appreciate any input to understand and potentially solve this problem.

Thanks
Shadab

following is my CUDA kernel call
int frame_rate, N=81*81;

size_t size = N*sizeof(float);

int blockSize = 512;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

LBM_comp <<< nBlocks, blockSize >>> (rho0_d, ux0_d, uy0_d, rho1_d, ux1_d, uy1_d, N,
f0_d, f1_d, f2_d, f3_d, f4_d, f5_d, f6_d, f7_d, f8_d, ftemp0_d, ftemp1_d, ftemp2_d, ftemp3_d, ftemp4_d, ftemp5_d, ftemp6_d, ftemp7_d, ftemp8_d,
g0_d, g1_d, g2_d, g3_d, g4_d, g5_d, g6_d, g7_d, g8_d, gtemp0_d, gtemp1_d, gtemp2_d, gtemp3_d, gtemp4_d, gtemp5_d, gtemp6_d, gtemp7_d, gtemp8_d,
is_solid_d, frame_rate);

% f and g’s are pdf’s for two components.

Following is my CUDA memory allocation for all the arguments:

cudaMalloc((void **) &rho0_d, size);
cudaMalloc((void **) &ux0_d, size);
cudaMalloc((void **) &uy0_d, size);
cudaMalloc((void **) &is_solid_d, size);
cudaMalloc((void **) &f0_d, size); cudaMalloc((void **) &ftemp0_d, size);
cudaMalloc((void **) &f1_d, size); cudaMalloc((void **) &ftemp1_d, size);
cudaMalloc((void **) &f2_d, size); cudaMalloc((void **) &ftemp2_d, size);
cudaMalloc((void **) &f3_d, size); cudaMalloc((void **) &ftemp3_d, size);
cudaMalloc((void **) &f4_d, size); cudaMalloc((void **) &ftemp4_d, size);
cudaMalloc((void **) &f5_d, size); cudaMalloc((void **) &ftemp5_d, size);
cudaMalloc((void **) &f6_d, size); cudaMalloc((void **) &ftemp6_d, size);
cudaMalloc((void **) &f7_d, size); cudaMalloc((void **) &ftemp7_d, size);
cudaMalloc((void **) &f8_d, size); cudaMalloc((void **) &ftemp8_d, size);

cudaMalloc((void **) &rho1_d, size);
cudaMalloc((void **) &ux1_d, size);
cudaMalloc((void **) &uy1_d, size);
cudaMalloc((void **) &g0_d, size); cudaMalloc((void **) &gtemp0_d, size);
cudaMalloc((void **) &g1_d, size); cudaMalloc((void **) &gtemp1_d, size);
cudaMalloc((void **) &g2_d, size); cudaMalloc((void **) &gtemp2_d, size);
cudaMalloc((void **) &g3_d, size); cudaMalloc((void **) &gtemp3_d, size);
cudaMalloc((void **) &g4_d, size); cudaMalloc((void **) &gtemp4_d, size);
cudaMalloc((void **) &g5_d, size); cudaMalloc((void **) &gtemp5_d, size);
cudaMalloc((void **) &g6_d, size); cudaMalloc((void **) &gtemp6_d, size);
cudaMalloc((void **) &g7_d, size); cudaMalloc((void **) &gtemp7_d, size);
cudaMalloc((void **) &g8_d, size); cudaMalloc((void **) &gtemp8_d, size);

tbochtl · February 10, 2010, 3:09pm

Hi shadab,

I’m not sure if there is a limitation on the number of arguments you can pass to a kernel in CUDA. However you might be able to reduce the number of arguments using one linear array for all of your pdf’s.

this would look for example like
cudaMalloc((void **) &f_d, size * 9 );

or for the velocity array
cudaMalloc((void **) &u0_d, size * 2 );

then, you can access the desired particle distribution via
f_d[ n * size + cellindex ]
where n is the particle distribution you want to access ( n in [0…8] )

analog the access for the velocity components would look like:
u_x = u0_d[ 0 * size + cellindex ]
u_y = u0_d[ 1 * size + cellindex ]

I hope this helps,
Thomas

Alex · February 20, 2010, 1:00am

Hi Shadab,

If you need OpenCL code based on this paper I can share with you.

Alex

brucedjones · February 23, 2010, 10:37am

Hi guys,

Any chance I could get a copy of the OpenCL code mentioned. I played with the LBM using CUDA and got some encourageing results, I’d be interested to see how the OpenCL implementation compares with the CUDA implementation

Cheers,
Bruce

tbochtl · February 23, 2010, 3:56pm

Hi there,

just like Bruce I’m also interested in the OpenCL implementation. If there is a chance to get it I’d be very pleased…

Regards,
Thomas.

Alex · February 23, 2010, 11:03pm

Hi guys,

Write me to email shurik.kuzmin@gmail.com - I will send you.

Alex