[Parallel] How to "get" parallel information more effciently ?

fangwlp · July 11, 2008, 3:10am

Dear Sirs,

I got two questions.

When I was running the cavity2d example in the OpenLB 1.5, I found that the case with 4 CPUs was slower the serial case in our machine. I have checked that their answer were totally the same. Is it normal? If so, would you mind explaning to me the reasons. Thanks.
In my own simulation case, there are some nodes which are not located at the lattice grid. And the informaton of these nodes are interpolated by surrounding lattice nodes. In implementation, I use a lot of “get” functions to obtain those surrounding information. However, it is really inefficient. I think it may be because in the “get” function MPI_Bcast is called. Is there any more efficient method to solve this problem ? Thanks.

Regards, Lipen

jlatt · July 11, 2008, 5:46am

If you run the example as it is, it probably contains a lot of I/O (production of GIF images). This slows down parallel execution substantially. Comment out all the I/O to get good speed ups. On a good network, the speed up of this program should be essentially 4 on 4 processors. Be sure to measure time inside the program, after the parallel machine has been started (or otherwise, measure over a large number of iterations so that the initialization time of the parallel machine is negligible).
The “get” function should be used only sporadically, and not to implement actually algorithmically relevant parts, as it is slow in parallel. Define a new postprocessor instead to implement non-local dynamics efficiently. For an example of postprocessors, you may want to have a look at the implementation of boundary conditions.

fangwlp · July 31, 2008, 4:16am

I found that when I use the get function under the PARALLEL_MODE_MPI, I always get zero; however, in the same code I get non-zero value when the PARALLEL_MODE_MPI is off.

Most of my program is implemented by following the cavity2d, 3d examples, except that I would get some information from the velocity field before conducting collideAndStream. Part of program can be shown as follows:

###############################################################################

TensorFieldBase3D<T,3> const& velField = lattice.getDataAnalysis().getVelocity();

int id_nx = (int)floor(pt.x); // pt.x, pt.y, pt.z are position with double(T) type.
int id_ny = (int)floor(pt.y);
int id_nz = (int)floor(pt.z);

int id_px = id_nx + 1;
int id_py = id_ny + 1;
int id_pz = id_nz + 1;

T w_px = pt.x - (double)(id_nx);
T w_py = pt.y - (double)(id_ny);
T w_pz = pt.z - (double)(id_nz);

T w_nx = 1.0 - w_px;
T w_ny = 1.0 - w_py;
T w_nz = 1.0 - w_pz;

T u[3];
for(int dim = 0; dim < 3; dim++)
{
u[dim] = velField.get(id_nx, id_ny, id_nz)[dim] * w_nx * w_ny * w_nz;
u[dim] += velField.get(id_nx, id_ny, id_pz)[dim] * w_nx * w_ny * w_pz;
u[dim] += velField.get(id_nx, id_py, id_nz)[dim] * w_nx * w_py * w_nz;
u[dim] += velField.get(id_nx, id_py, id_pz)[dim] * w_nx * w_py * w_pz;
u[dim] += velField.get(id_px, id_ny, id_nz)[dim] * w_px * w_ny * w_nz;
u[dim] += velField.get(id_px, id_ny, id_pz)[dim] * w_px * w_ny * w_pz;
u[dim] += velField.get(id_px, id_py, id_nz)[dim] * w_px * w_py * w_nz;
u[dim] += velField.get(id_px, id_py, id_pz)[dim] * w_px * w_py * w_pz;
}

###############################################################################

My compiler I use is pgi ( linux86-64/7.0-5 ).

Thanks.