Multi processor to reduce the compputational time// Parallel running

Hello everybody;
In order to reduce the computational time, I used the multi processors to run my model. However, I have a few questions hoping that someone can help me.
1/ How to pick the right number of processors to optimize the running time? Is it the higher the processors that I use, the faster the speed?
2/ I read the Palabos guideline, it stated that " Note that the execution time of the example program is dominated by output operations. To observe a significant speed improvement in the parallel program, the output operations need first to be commented in the source code" . I am not sure what it mean ?
3/ How to control the number of processor in each direction x,y,z. For example, I used the command mpirun - np 8, how does the LBM/Palabos distribute the number of process. If I want 1 processor in x, 4 processors in y and 2 processors in z,can I control that?
Thank you

Hi Nina,

I’ll try to answer these with what i’ve observed and learned but if anyone sees a mistake with what i say please feel free to correct me !

  1. This is more of a trial and error type of thing, there’s no specific rule AFAIK, however, what you’ll observe is that you’ll get faster speeds with more processors up to a certain point, after which the benefit diminishes. The way I judge this is to include the time to each output. this gives you an idea of how much you’re gaining for different number of cores. In my case i’ve seen good scaling all the way up to 36 cores and I suspect this increase would go on for a bit longer if I had the resources.

  2. What this means is that everytime you output i.e save an image, checkpoint etc, the simulation will have to wait until that output is finished before it can continue. So if you were to run a simulation for 10 time steps and output every time step, it would take far longer to finish then if you just did one output at the end.

  3. (this is the one i’m least sure about ) but the way Palabos seems to distribute its’ cores is by splitting your domain into multiple smaller domains which get sent out to each processor. You can see this if you look at the PLB files of a checkpoint. As far as if you can manually distribute them I’m not sure however, if you really want to specify and you are good at c++ i’m sure there’s a way ?

Just as an example here is my model gets split over 8 cores where each component id is a processor

        <Component id="0"> 0 101 0 270 0 88 </Component>
        <Component id="1"> 0 101 0 270 89 177 </Component>
        <Component id="2"> 0 101 271 541 0 88 </Component>
        <Component id="3"> 0 101 271 541 89 177 </Component>
        <Component id="4"> 102 202 0 270 0 88 </Component>
        <Component id="5"> 102 202 0 270 89 177 </Component>
        <Component id="6"> 102 202 271 541 0 88 </Component>
        <Component id="7"> 102 202 271 541 89 177 </Component>

Hope it helps !

1 Like
  1. This is more of a trial and error type of thing, there’s no specific rule AFAIK, however, what you’ll observe is that you’ll get faster speeds with more processors up to a certain point, after which the benefit diminishes. The way I judge this is to include the time to each output. this gives you an idea of how much you’re gaining for different number of cores. In my case i’ve seen good scaling all the way up to 36 cores and I suspect this increase would go on for a bit longer if I had the resources.

As @Catsgomeow says there is no definite way to know precisely how many processor/cores to use and it depends from your simulation setup. A good rule of thumb is not to have blocks that are smaller than 16^3.

  1. (this is the one i’m least sure about ) but the way Palabos seems to distribute its’ cores is by splitting your domain into multiple smaller domains which get sent out to each processor. You can see this if you look at the PLB files of a checkpoint. As far as if you can manually distribute them I’m not sure however, if you really want to specify and you are good at c++ i’m sure there’s a way ?

That’s correct. Actually the splitting and the distribution to the cores can be completely decoupled and performed “by hand”. By default the simulation is cut in N blocks (N is the number of cores you have) of similar size that are as “cubic” as possible. For an example of manual block decomposition have a look in examples/codesByTopic/multiBlock

[/quote]

1 Like

Hi @orestis and @Catsgomeow , thank you for answering my questions, I appreciate you help.
I have one more question, due to the fine mesh size, my output file is really heavy and take long time to load, the information that I want to obtain from the output is pressure and velocity, I am wondering that are there any ways to reduce the size of the output file ? Thank you

Hello,
if you are not interested in the whole domain you can output subdomains. You can find an example in

examples/showCases/generalExternalFlow/generalExternalFlow.cpp

lines 1267.