Parallel/Concurrent Programming

I am currently at the start of my PhD project and planning to look pretty extensively at the Lattice Boltzmann method. I’m sure everyone here is aware that the LB method lends itself very well to parallel programming, however I’d be interested to know how everyone here goes about implementing the LB method using a parallel programming language, more specifically which language/implementation do you use and why?

I’ve played a bit with CUDA and eventually I plan to do some work with openCL however those nvidia fermi cards i have my eye on are a bit expensive for this years budget so im working with a traditional quad core machine. In the mean time I need to implement the LB method using a parallel programming language and it seems im spoiled for choice. I’ve played a bit with google’s Go (www.golang.org) and that seems pretty good, but the language isnt mature enough for me to be happy useing it. I’ve also read that stackless python might be good. Speaking to someone earlier they mentioned good old fortran should work well for this. For those of you with experience with these languages, which is the simplest? which is the most powerful? I’m also interested in hearing any other thoughts you may have on this subject.

If you want to set up efficient simulations, you should take languages like Fortran or C/C++. For parallelization I recommend MPI. Using OpenMP restricts your simulations to a small number of processors.
Rule of thumb: the closer the programming language is to the hardware, the faster will the simulations be. In other words: Matlab won’t give you efficient simulations.
Please also consider that GPU based simulations may be very fast for the plain LBM, but in the moment you couple additional physics to your system (FEM for example), CUDA may lose its advantages.
What exactly do you want to do in your PhD?

Hi there

Although I´m new to LBM, I´d recommend gpu, you can go with cuda or opencl, I´m using opencl because of the transparent scalability, it uses 2 cores on my laptop, 8 cores at work and 128 when running on gpu, with practically no extra effort. but if your project is going to run on a “cluster”, mpi is the way to go.

by the way tim, my project is also coupling lbm and fem on gpu, could you give me any hints about the slow down you mention?

thanks in advance

The slowdown I mentioned occurs when the code you use is not entirely parallel any more.
As far as I know, computations on GPUs are very fast as long as the simulations stick to some basic paradigms. In other words, one cannot efficiently run any arbitrary code on a GPU. Please correct me if I am wrong.

Thanks guys, I think I’m going to look into fortran purely out of interest. jmazo could you point me to some good tutorials for using openCL, I’ve had a look but it seems a bit confusing.

We’ve got a few applications in mind for this, predominantly it will be porous media flows we also want to do some multi component and multiphase stuff.

For quad-core MPI is the solution. It can go either with Fortran or C. It’s the question of your preference.

If you are still interested in GPU stuff, you can use Python binded with Cuda or OpenCL (pycuda,pyopencl). Basically, it’s the overlay bindings to python written initially on C/C++. Basically you get easy programming with Python with high-performance of C.

Hopefully it will help,
Alex

Hi Alex,

Thanks for the great reply, I’ve been told that MPI might be too “heavy-handed” for a simple LB program, can you elaborate on this?

How mature is the pyopencl? also how does it compare to just writing openCL kernels directly?

Thanks for these responses guys, as a somewhat amateur programmer its always helpful to see how everyone else is doing these things.

For a pure LBM code, it is relatively easy to parallelize the code with MPI. Apart from the initialization of MPI (which is not hard), you only need a handful of functions, i.e., you do not have to use all of MPI’s capabilities. On the other hand, I do not know how difficult it is to parallelize the code in such a way that it runs on a GPU.

Once the relevant library (CUDA/OpenCL) is well understood its actually no harder to get an LB code to run on the GPU that it is on the CPU, the tricky part with those libraries is they are not easy to use or understand when compared to those parallel programming implementations which already exist for the CPU. A brief look at a google Go code compared to a openCL code will be enough to demonstrate this.

Hi Bruce,

I agree with Timm that the implementation of the simple flow with MPI is pretty simple. However, if you have machine with the shared memory it’s even more straitforward to do OpenMP. If you have the distributed system you need to use MPI.

For the pyopencl - I’ve never used it by myself, but I’m in the mailing list :slight_smile: So, basically they do everything what OpenCL C++ can do - it’s my impression out of it. In any case you it’s binding - you kind of write the kernel but with python commands as I understand (I don’t know exactly :frowning: ). And python is just simpler than C++.

I recommend you to look to this project http://gitorious.org/sailfish - it’s LBM solver with Python - they prepare a kernel with Python commands - you can look exactly what they do.

Sorry can’t provide a lot of details, don’t have enough knowledge,
Alex

??? ??? ??? ? ???. ???, !!! ??? ??? ??? ?? ??? «??? ??? ? ???».?? ??? ??? ??? ??? ??? ??? ??? ???. ? ??? ??? ??? ??? ??? ??? ???, ??? ??? ? !!! ??? ??? ??? ??? ??? ??? ?? ??? ???. ??? ? ??? ??? ??? ??? ??? ? ??? ? ??? ??? ? ??? ???, ??? ? ???. ??? ??? ??? ? ??? ??? ???. ??? ? ??? ?? ???, ??? ??? ??? ?? ???, ??? ??, ??? ??? ???, ? ???. ??? ??? !!! ??? ??? ??? ??? ??? ??? ??? !!! ??? ? ??? , ??? ???. ??? ? ??? ???, ??? ??? ??? ?? ???, ??? ? ??? ??? ???, ??? ???, ??? ?? ??? ???. ??? ??? ??? ??? ??? , ???