scaling in CUDA !!

Hi,

I have a working LBM code written in CUDA for GPU’s. The code works fine for nodes (N) less than 7,000. The output is simply Nan When I try to simulate for bigger domain.

I have observed this behavior both on Linux and Windows based machine.
I understand that each CUDA kernel call should be shorter than the frame rate of GPU, which may be 1/100th of a second.

Is there a way to get around this limitation of frame rate. How can I simulate for larger domain using GPU?

Following is kernel call inside the time loop. I have tried both 1-D and 2-D blocks.

int blockSize = 256;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

for(t=1; t<=frame_rate*num_frame; t++)
{

// calculation on device:

collision <<< nBlocks, blockSize >>> (f0_d, rho0_d, ux0_d, uy0_d, is_solid_d, ns_d, tau0, F_gr_d);

streaming <<< nBlocks, blockSize >>> (f0_d, ftemp0_d);
bcs_fluid <<< nBlocks, blockSize >>> (f0_d, rho0_in, rho0_out, ux_in);
macro_vars <<< nBlocks, blockSize >>> (f0_d, rho0_d, ux0_d, uy0_d);

}

Do I need to make asynchronous call for kernels?.

Thanks
Shadab

Hi Shadab,

Ive actually been following some of your forum posts both here and in the CUDA forums to debug some of my own code :slight_smile:

In the last few weeks I have written a simple CUDA code for the 2D cylinder flow problem, and for a long time I was getting issues with NaN results on large domains also. Two things which I found helped were:

completely avoid using shared memory in CUDA kernels. This is counter to all advice on writing CUDA kernels but so far I just haven’t managed to make it work with shared memory

pass pointers to structs of arrays to the kernels rather than pointers to arrays (got that from your thread on nvidia forums :slight_smile: )

use the __synchthreads() function after any IF statement, not all IF statements within kernels cause problems however I have identified a few in my code that can cause erroneous results if __synchthreads() is not called.

The code I have now works for domains with N of any size and I would be happy to send it to you however personally I dont think the code is that good because it doesnt take advantage of the speed of shared memory.

Also a note, you dont want to make those kernel calls asynchronous, they should be executed sequentially as you have it now.

Did you manage to get the streaming function to work as proposed in Dr. Tolke’s paper? As I was not able to get any decent results with that streaming function despite my best efforts.

I am very interested in reading anything you care to write on this code since I think we are encountering the same problems.

Regards,
Bruce

Hi Bruce,
I am glad that you resolved the problem I have been dealing with.
Thanks for your suggestions. I have been passing pointers to arrays not pointers to structures/arrays of structures in kernels.
I don’t understand shared memory well enough to implement in my code. Hence, I haven’t followed the (efficient) method proposed by Dr. Tolke.

Could you please send me your code at sanwa001@fiu.edu. I will be glad to share my code with you and work together to learn more about CUDA and LBM.

Thanks
Shadab

I’ve just sent the code to you, has anyone else here managed to get a working implementation of Dr. Tolke’s method?

Hello,

I have OpenCL implementation for shallow-water 2-d problem.

I can send you if you need it,
Alex

Hi Alex,
Could you please send me your OpenCL-LB code .

sanwa001@fiu.edu

Thanks
Shadab

Hi Alex,

I am trying to implement an efficient LBM simulation in OpenCL for ATI GPUs. Can you please send the code to me too?
My mail ID - priyadarshi@darshan3d.com

Thanks
Priyadarshi

Hi,
Just wanted to post an update that my problem is resolved by Carlos Fernandez at The University of Texas at Austin. He suggested to do “streaming” in two separate kernel calls. That solved the problem. Now I can simulate single or multi-phase flow problem for any domain size.

Shadab

Hi Shadab,

Can you elaborate on this?

Bruce

Hi Bruce,

I use ftemp to save post-streaming f’s and later copy them back to f’s after streaming is done for entire domain.

I have two streaming kernels
streaming () and streamingUpdate()


__global void streaming ()
{

//idx is thread id
//ip and jp should be written in terms of idx

ftemp(ip, jp,1:9)  =  f(i,j,1:9) //ip and jp are nearest neighboring nodes

}


__global void streamingUpdate()
{

f(i,j,1:9) = ftemp (i,j,1:9)

}

streaming() and streamingUpdate() are called sequentially from main program.