[GPU] C++ interface.

I found this link that could be interesting. It is a C++ template library that is an interface to CUDA on GPUs:

A related article using LBM is in 2008,

“J. Tlke and M. Krafczyk 2008, TeraFLOP computing on a desktop PC with GPUs for 3D CFD”



Since 2008 there has been work on a standard for writing programs that run on heterogeneous CPUs and GPUs. This OpenCL. OpenCL has a C99 bindings and has experimental C++ bindings:


for a tutorial,


I try to post here relevant links that I find related to the topic. My curiosity is how difficult is it to write a software layer that makes Xflow take advantage of OpenCL. Unfortunately, I’m qualified to find the answer to this question myself. That is why, I put this information here and I would be grateful if the developers of Xflow shares their opinion about the topic. This is especially important if the TeraFLOP in the above article could be achieved on a machine that has a price of order 4000 dollars using NVIDIA TESLA,




Hello Maka,

Thank you for keeping us posted; your links, especially the one to the thrust library are very interesting. We have been considering including GPU functionality into xFlows for quite a while know. As a side remark, similarly to the GPU, it seems interesting to consider implementing LB code on the Cell processor, the processor which is found inside the PlayStation 3.

Just rewriting xFlows in CUDA/OpenCL/Thrust does not seem to be possible. Remember that there are around 100’000 lines of code. Just re-writing these lines verbatim, without modification, would keep you busy for several weeks. Not to mention the burden of translating into a different language and into a different programming approach. Furthermore, GPUs are not general purpose processors: they work well only for algorithms which can be cast into an SIMD formulation. This is the case for the basic LB algorithm, but not for many of the secondary algorithms which went into the code, such as the algorithm for managing the multi-block structure, or some of the non-local boundary conditions.

I realize that the library thrust offers a workaround. While the main code is executed on a CPU, only a few inherently SIMD algorithms of the STL are executed by the GPU. Although this is conceptually neat, I do have doubts that this approach leads to the TeraFlop performance mentioned in the paper you are referencing, because there is a price to pay for copying data between CPU and GPU frequently (one would need to try out to be sure, of course).

What we are doing right know is to adapt the structure of xFlows to make it more friendly toward hybrid approaches, similar to the one used in the library thrust. The idea is to identify homogeneous sub-domains of the simulation which contain only, say, BGK nodes and no boundary condition. The calculations on this domain could then be executed by what people like to call a “computational kernel”: a random computational instance which does the job fast, by using for example the GPU or a Cell processor. At this point, I guess that it would be reasonably easy to write such a kernel for the GPU, even for a person who is unfamiliar with C++ and with xFlows. However, this requires quite an amount of structural changes to xFlows, and a lot of technical work like specifying a communication protocol between xFlows and the kernels. We are working on it, but we won’t be done any time soon.

Many thanks Jonas for your explanations. Another link that may be relevant is the PGI OpenMP-like directives to enable GPU programming:


See additional resources on that page for demos especially this one,


This could be used a quick prototype to estimate what benefits could be gained from making a certain parts of a program GPU enabled.


This is another C++ interface that I found,


Two important changes in the scene of GPU:

Best regards,

Hi All,

I have not utilized Palabos code and looking at possibilities of utilizing this code in my research group. I am wondering if this code can be utilized with GPGPU effectively? Anyone have successfully tested with GPU previously? if yes, what are the typical challenges that we might face during the process? What will be the learning curve (e.g. difficulty level, time frame, etc.)? This thread was last updated 4 years ago, so just wondering if any latest update within this context?

Thank you.


Cause of the palabos was writen in c++ using template , it can be utilized with CUDA more effectively. If you want to use opencl or openacc , it maybe more difficult. Howerver , there are a lot of work to no metter which one you choose.