Hi,
I have a working LBM code written in CUDA for GPU’s. The code works fine for nodes (N) less than 7,000. The output is simply Nan When I try to simulate for bigger domain.
I have observed this behavior both on Linux and Windows based machine.
I understand that each CUDA kernel call should be shorter than the frame rate of GPU, which may be 1/100th of a second.
Is there a way to get around this limitation of frame rate. How can I simulate for larger domain using GPU?
Following is kernel call inside the time loop. I have tried both 1-D and 2-D blocks.
int blockSize = 256;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
for(t=1; t<=frame_rate*num_frame; t++)
{
// calculation on device:
collision <<< nBlocks, blockSize >>> (f0_d, rho0_d, ux0_d, uy0_d, is_solid_d, ns_d, tau0, F_gr_d);
streaming <<< nBlocks, blockSize >>> (f0_d, ftemp0_d);
bcs_fluid <<< nBlocks, blockSize >>> (f0_d, rho0_in, rho0_out, ux_in);
macro_vars <<< nBlocks, blockSize >>> (f0_d, rho0_d, ux0_d, uy0_d);
}
Do I need to make asynchronous call for kernels?.
Thanks
Shadab