Computational time for elementary operations

Hi All,

I just read the paper “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins and Cass T. Miller, 2003. They propose a Sparse Lattice Representation (SLR) in the case of porous media simulations. The benefit from a memory point of view is obvious since you only have to store the distribution functions of the Fluid nodes but I have my doubts concerning the execution speed. Since the neighboring information of the regular lattice gets lost in the SLR, every node needs to remember its neighbours. So every time I do the streaming step I have to look up which is the neighbouring node before I can access it. In the case of a regular lattice the indices are created at runtime and don’t need to be looked up. So my question is: Is it faster to look up the neighbouring index which is stored somewhere (as is the case in the SLR) or is it faster to create the idex during runtime by executing a loop?

Thanks for taking the time and best regards,

Martin

As always, it is quite difficult to draw clear conclusions from benchmarks, because the real value of a code depends on much more than a single performance number. Just as important are how well it is adapted to the problems you investigate, how user-friendly, and how portable it is, to just mention a few points. Therefore, all I am going to write should we appreciated with care.

OpenLB is based on a regular-matrix philosophy, which is opposed to the individual link-tracking philosophy, as the one in the article you mention. Actually, in the early days of OpenLB back at the University of Geneva, there was another LB code circulating, which is written by Alexandre Dupuis, and which uses individual link-tracking ( http://dx.doi.org/10.1016/S0167-739X(99)00130-2 ). We have unfortunately never run comparison benchmarks, as Alexandre left the group approximately when I arrived, so this stays an open question.

Obviously, the matrix-based formalism is well adapted to modern cache-based processors, and is very efficient in regular geometries. Of course, a link-based formalism becomes interesting in a porous media with only few fluid nodes, but my guess is that for this advantage to be significant, the media must be very porous. My intuition is that you can easily lose an order of magnitude of efficiency by abandoning a matrix-based formalism, whereas the gain from reducing the number of active nodes in a link-based formulation gains you a few factors at best.

I’ve had a quick and not-very-careful look at the article you’ve cited, and here’s how I read Fig. 6 (please tell me if I misread it). The authors are benchmarking a porous and a non-porous media confined in a geometry which, at its largest, has a size of 80^3. Let’s focus on this case and compute the number of site updates per second they reach, taking into the calculation all lattice sites, not only fluid nodes.

In a non-porous media they observe 80^3*100/600/1e6 = 0.09 mega site updates per second, no matter what code they use.
In a porous media, the regular code again reaches 0.09 mega su/s, whereas the link-based one goes up to 0.26 mega su/s. The conclusion is a speed up of more than 2, which seems to be in favor of the link-based approach.

But as I am running OpenLB on an Intel which was bought in 2003 (this should make it comparable to the machine used in the article), I get a performance of over 1 mega su/s in a regular 80^3 geometry, with the D3Q19 lattice. That’s four times more. A conclusion here could be that the link-based approach is not really fast, it’s just that the authors compare it with a (very) slow matrix-based code. But as I said, one should never draw too many conclusion from a single benchmark value.

Thanks a lot for your answer Jonas. When looking at the benchmarks in this paper I also thought that the matrix-based code they compare it to is very slow. My idea was to program both methods myself and to compare them. The problem I came accross now is that my matrix-based code is not even half as fast as OpenLB: 0.48Msups compared to 1.13Msups (serial execution)! My program is very simple and does not use any object-oriented features. Are there any other numerical tricks except the swap-trick or is it just my bad programming?

Thanks again,

Martin

Hi Martin,
since you are not using object oriented programming there is a tool you can use gprof to “profile” your code and see where your code spends the most of the time. If I remember well you just have to add the option -g when compiling and then use this gprof. It will give you as an output the time spent in each of your subroutine and help you optimize your code.

There are mainly two basic places where you can lose lots of time.

The first is the way you access memory while doing your loops (stride). It is of course much faster to access continuous blocks of memory than jumping from a point to another. It depends on which language you use, but you should check the ordering of the indexes in your loops. Depending on the compiler you use, you could also get some benefits by unrolling your loops manually (if your compiler is “smart enough” you will not have a huge benefit, but if it’s not that good then you can have huge gains, but to my knowledge there is no way to know it without trying…).

The second one, is to try to reuse the cache as much as possible, since the access to the cache is a lot cheaper than the access to memory.

I hope it will help you a bit…

Orestis

Indeed, profiling the code is always a good idea. I would say that the value you get, somewhere below half a mega su/s, is typical for normal LB codes without advanced optimizations. Here are a few things you can try in order to achieve better performance (if you are willing to accept that after each optimization, your code gets uglier)

  • Use OpenLB instead (just joking…)
  • Have you implemented the swap trick? If not, do it; it’s not that difficult
  • Write out the value of the equilibrium for every population function, instead of writing a loop over population functions
  • Avoid having an if/then/else condition inside the loop over space. Don’t try to distinguish between bulk cells and boundary cells. Write a loop over bulk cells first, and then a loop over boundary cells.
  • Don’t compute things twice. Realize that the sum to compute rho can be split into partial sums. And the sum to compute rho*u as well. And some of these partial sums are the same. So compute them only once.
  • Other optimizations exist, which are not implemented in OpenLB. A very common one is loop blocking, which improves cache re-use, as suggested by Orestis (Google for “lattice boltzmann” and “loop blocking”).

Thanks Orestis. Just tried gprof - it seems to be a great tool! I program in c++ and use g++ as a compiler. I’ll try to compile the code using other compilers as well to see if it makes any difference.

Thanks again,

Martin

Oh, I just discovered your reply Jonas! I wrote my last response just when you wrote yours! Thanks a lot for the many hints, I’ll see what I can make out of it.

jlatt Wrote:

  • Use OpenLB instead (just joking…)
  • Have you implemented the swap trick? If not, do
    it; it’s not that difficult
  • Write out the value of the equilibrium for every
    population function, instead of writing a loop
    over population functions
  • Avoid having an if/then/else condition inside
    the loop over space. Don’t try to distinguish
    between bulk cells and boundary cells. Write a
    loop over bulk cells first, and then a loop over
    boundary cells.
  • Don’t compute things twice. Realize that the sum
    to compute rho can be split into partial sums. And
    the sum to compute rho*u as well. And some of
    these partial sums are the same. So compute them
    only once.
  • Other optimizations exist, which are not
    implemented in OpenLB. A very common one is loop
    blocking, which improves cache re-use, as
    suggested by Orestis (Google for “lattice
    boltzmann” and “loop blocking”).

I have tried to implement the swap trick. However, I am using the simple bounce back scheme for collisions with solids. I am getting a problem here as the value I need for the bounce back is being swapped to a neighbor lattice.

Double buffer bounce back version: buffer1.value[i] = buffer2.value[i_Opposite]
However, when swap trick is used, “i_Opposite” is swapped to the neighboring lattice’s “i+half” index or something.

Does the swap trick support bounce back or do I have to use some other technique? Can the swap trick be used with blocking?

  • Atle

I figured it out. I had to do it the other way around in the sreaming step:

swap (f[iX][iY][iF], f[nextX][nextY][iF + half])

instead of:

swap (f[iX][iY][iF + half], f[nextX][nextY][iF])

Then there was no need to do anything in the bounceback step, since the value was already there from the swap in collision.

  • Atle

Atlemann Wrote:

I figured it out. I had to do it the other way
around in the sreaming step:

swap (f[iX][iY][iF], f[nextX][nextY][iF + half])

instead of:

swap (f[iX][iY][iF + half], f[nextX][nextY][iF])

Then there was no need to do anything in the
bounceback step, since the value was already there
from the swap in collision.

  • Atle

Hi!

Think I have some problems with the flow direction…

Although i get the correct answer for Absolute Permeability, i found out that the fluid flows in the opposite direction than what seems logical, which does not matter for abs-perm. However, I am working on a two-phase code with imbibition and drainage in which this “opposite” flow direction messes up the imbibition and drainage flow directions. Might this have something to do with my changing the indices of the swapping in the streaming step? Can’t seem to get my head around this…

e.g. my logical flow direction is in the positive lattice direction (positive direction vectors in X). e.g. c[1] = -1 and c[10] = 1 in X, thus c[10] should be the positive flow direction. However, when I manipulate the inlet layer of the model, I have to manipulate c[1] to get the fluids to flow into the model, which tells me that the inlet is actually the oulet. If i use a negative force in the body forces for the lattices, I get the expected flow direction.

I use a grid communicator in MPI for the splitting of the model over multiple CPU’s. My inlet nodes are the following:

myRank_ = gridCommunicator_.Get_rank();
MPI::Compute_dims (nProcessors_, 3, communicatorDims_);
gridCommunicator_.Shift (0, 1, rankXMinus_, rankXPlus_);
gridCommunicator_.Get_coords (myRank_, 3, rankCoordinates_);

isOnInlet = (rankCoordinates[0] == 0);
isOnOutlet = (rankCoordinates[0] == communicatorDims_[0] - 1);

  • Atle