MPI job won't work on multiple hosts

11bolts · January 25, 2013, 7:23pm

This is probably not a bug, but it is unexpected behavior, so I thought I’d post here. I’m trying to run sinepipe_perm in the $PALABOS/examples/tutorial folder. I’ve compiled it for MPI (i.e. I changed the appropriate flags in the Makefile) and it compiles fine. When I submit the code to my cluster and all of the processes fit on the same host, everything is fine. However, when the processes are spread across more than one host, the run fails with the error messages below. I’ve reproduced this behavior on two different clusters (one Rocks, Grid-Engine, one RHEL6, LSF). I’ve also used a simple matrix multiplication MPI code on both of these clusters, and there are no problems with multiple hosts in that case. Any ideas?



[n014:10985] *** Process received signal ***
[n014:10985] Signal: Segmentation fault (11)
[n014:10985] Signal code: Address not mapped (1)
[n014:10985] Failing at address: 0x211
[n014:10984] *** An error occurred in MPI_Recv
[n014:10984] *** on communicator MPI_COMM_WORLD
[n014:10984] *** MPI_ERR_TRUNCATE: message truncated
[n014:10984] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[n014:10985] [ 0] /lib64/libpthread.so.0() [0x3a2d60f4a0]
[n014:10985] [ 1] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE18receive_regenerateENS_5Box3DERKSt6vectorIcSaIcEER[n014:10985] [ 2] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE7receiveENS_5Box3DERKSt6vectorIcSaIcEENS_5modif6Mod
[n014:10985] [ 3] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D11communicateERNS_24CommunicationStructure3DERKNS_12MultiBlock3DERS3_NS_5modif6ModifTE[n014:10985] [ 4] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D17duplicateOverlapsERNS_12MultiBlock3DENS_5modif6ModifTE+0x396) [0x5c3f06]
[n014:10985] [ 5] ./permeability(main+0x1c6) [0x4d67c6]
[n014:10985] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3a2d21ecdd]
[n014:10985] [ 7] ./permeability() [0x4cdf89]
[n014:10985] *** End of error message ***
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mc
[n014:10984] [[545,1],2] attempted to send to [[545,1],0]: tag 14
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mc
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../ompi/runtime/omp[n014:10982] *** Process received signal ***
[n014:10982] Signal: Segmentation fault (11)
[n014:10982] Signal code: Address not mapped (1)
[n014:10982] Failing at address: 0x211
[n014:10982] [ 0] /lib64/libpthread.so.0() [0x3a2d60f4a0]
[n014:10982] [ 1] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE18receive_regenerateENS_5Box3DERKSt6vectorIcSaIcEER
[n014:10982] [ 2] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE7receiveENS_5Box3DERKSt6vectorIcSaIcEENS_5modif6Mod
[n014:10982] [ 3] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D11communicateERNS_24CommunicationStructure3DERKNS_12MultiBlock3DERS3_NS_5modif6ModifTE
[n014:10982] [ 4] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D17duplicateOverlapsERNS_12MultiBlock3DENS_5modif6ModifTE+0x396) [0x5c3f06]
[n014:10982] [ 5] ./permeability(main+0x1c6) [0x4d67c6]
[n014:10982] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3a2d21ecdd]
[n014:10982] [ 7] ./permeability() [0x4cdf89]
[n014:10982] *** End of error message ***
[n013][[545,1],7][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (1[n013][[545,1],7][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (1
Jan 25 14:13:36 2013 9909 4 8.3 handleTSRegisterTerm(): TS reports task <3> pid <10985> on host<n014> killed or core dumped
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 10980 on
node n014 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

11bolts · January 28, 2013, 7:21pm

Update: I wasn’t aware that “sinepipe_perm” is not something that comes with Palabos, so no-one here would have heard of it. I’ve now tried to compile and run cylinder2d, which is part of the showCases folder. Sure enough, when all processes are on a single host, everything works, but when processes are spread across more than one host, the run fails. This time, the error messages are slightly different, with some std::bad_alloc messages. Has anyone encountered this? It seems strange that it should happen in an out-of-the-box sample code.


terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[n012:63215] *** Process received signal ***
[n012:63215] Signal: Aborted (6)
[n012:63215] Signal code:  (-6)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[n012:63217] *** Process received signal ***
[n012:63217] Signal: Aborted (6)
[n012:63217] Signal code:  (-6)
[n012:63215] [ 0] /lib64/libpthread.so.0() [0x37b180f4a0]
[n012:63215] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x37b1432885]
[n012:63215] [ 2] /lib64/libc.so.6(abort+0x175) [0x37b1434065]
[n012:63215] [ 3] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d) [0x2ac62327aa7d]
[n012:63215] [ 4] /usr/lib64/libstdc++.so.6(+0x37b20bcc06) [0x2ac623278c06]
[n012:63215] [ 5] /usr/lib64/libstdc++.so.6(+0x37b20bcc33) [0x2ac623278c33]
[n012:63215] [ 6] /usr/lib64/libstdc++.so.6(+0x37b20bcd2e) [0x2ac623278d2e]
[n012:63215] [ 7] /usr/lib64/libstdc++.so.6(_Znwm+0x7d) [0x2ac62327911d]
[n012:63215] [ 8] ./cylinder2d(_ZNSt6vectorIcSaIcEE14_M_fill_insertEN9__gnu_cxx17__normal_iteratorIPcS1_EEmRKc+0x8d) [0x4975dd]
[n012:63215] [ 9] ./cylinder2d(_ZN3plb20RecvPoolCommunicator14receiveDynamicEi+0x501) [0x52bd01]
[n012:63215] [10] ./cylinder2d(_ZN3plb20RecvPoolCommunicator14receiveMessageEib+0x95) [0x52beb5]
[n012:63215] [11] ./cylinder2d(_ZNK3plb27ParallelBlockCommunicator2D11communicateERNS_24CommunicationStructure2DERKNS_12MultiBlock2DERS3_NS_5modif6ModifTE+0x294) [0x527c04]
[n012:63215] [12] ./cylinder2d(_ZNK3plb27ParallelBlockCommunicator2D17duplicateOverlapsERNS_12MultiBlock2DENS_5modif6ModifTE+0x33e) [0x5290de]
[n012:63215] [13] ./cylinder2d(_ZN3plb12MultiBlock2D28duplicateOverlapsAtLevelZeroERSt6vectorISt4pairIPS0_NS_5modif6ModifTEESaIS6_EE+0x97) [0x501b67]
[n012:63215] [14] ./cylinder2d(_ZN3plb12MultiBlock2D25executeInternalProcessorsEv+0x3f) [0x50454f]
[n012:63215] [15] ./cylinder2d(_Z13cylinderSetupRN3plb19MultiBlockLattice2DIdNS_11descriptors14D2Q9DescriptorEEERKNS_16IncomprFlowParamIdEERNS_28OnLatticeBoundaryCondition2DIdS2_EE+0x881) [0x487f71]
[n012:63215] [16] ./cylinder2d(main+0x211) [0x48ced1]
[n012:63215] [17] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37b141ecdd]
[n012:63215] [18] ./cylinder2d() [0x486f99]
[n012:63215] *** End of error message ***
[n012:63217] [ 0] /lib64/libpthread.so.0() [0x37b180f4a0]
[n012:63217] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x37b1432885]
[n012:63217] [ 2] /lib64/libc.so.6(abort+0x175) [0x37b1434065]
[n012:63217] [ 3] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d) [0x2b85717cfa7d]
[n012:63217] [ 4] /usr/lib64/libstdc++.so.6(+0x37b20bcc06) [0x2b85717cdc06]
[n012:63217] [ 5] /usr/lib64/libstdc++.so.6(+0x37b20bcc33) [0x2b85717cdc33]
[n012:63217] [ 6] /usr/lib64/libstdc++.so.6(+0x37b20bcd2e) [0x2b85717cdd2e]
[n012:63217] [ 7] /usr/lib64/libstdc++.so.6(_Znwm+0x7d) [0x2b85717ce11d]
[n012:63217] [ 8] ./cylinder2d(_ZNSt6vectorIcSaIcEE14_M_fill_insertEN9__gnu_cxx17__normal_iteratorIPcS1_EEmRKc+0x8d) [0x4975dd]
[n012:63217] [ 9] ./cylinder2d(_ZN3plb20RecvPoolCommunicator14receiveDynamicEi+0x501) [0x52bd01]
[n012:63217] [10] ./cylinder2d(_ZN3plb20RecvPoolCommunicator14receiveMessageEib+0x95) [0x52beb5]
[n012:63217] [11] ./cylinder2d(_ZNK3plb27ParallelBlockCommunicator2D11communicateERNS_24CommunicationStructure2DERKNS_12MultiBlock2DERS3_NS_5modif6ModifTE+0x294) [0x527c04]
[n012:63217] [12] ./cylinder2d(_ZNK3plb27ParallelBlockCommunicator2D17duplicateOverlapsERNS_12MultiBlock2DENS_5modif6ModifTE+0x33e) [0x5290de]
[n012:63217] [13] ./cylinder2d(_ZN3plb12MultiBlock2D28duplicateOverlapsAtLevelZeroERSt6vectorISt4pairIPS0_NS_5modif6ModifTEESaIS6_EE+0x97) [0x501b67]
[n012:63217] [14] ./cylinder2d(_ZN3plb12MultiBlock2D25executeInternalProcessorsEv+0x3f) [0x50454f]
[n012:63217] [15] ./cylinder2d(_Z13cylinderSetupRN3plb19MultiBlockLattice2DIdNS_11descriptors14D2Q9DescriptorEEERKNS_16IncomprFlowParamIdEERNS_28OnLatticeBoundaryCondition2DIdS2_EE+0x881) [0x487f71]
[n012:63217] [16] ./cylinder2d(main+0x211) [0x48ced1]
[n012:63217] [17] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37b141ecdd]
[n012:63217] [18] ./cylinder2d() [0x486f99]
[n012:63217] *** End of error message ***
[n010:64639] *** An error occurred in MPI_Recv
[n010:64639] *** on communicator MPI_COMM_WORLD
[n010:64639] *** MPI_ERR_TRUNCATE: message truncated
[n010:64639] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[(null)][[20231,1],16][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],16][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],17][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],17][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],18][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],18][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],16][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],16][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],17][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[(null)][[20231,1],17][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Jan 28 13:59:49 2013 64626 4 8.3 handleTSRegisterTerm(): TS reports task <13> pid <63217> on host<n012> killed or core dumped
[n010:64632] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[n010:64632] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpirun has exited due to process rank 13 with PID 63214 on
node n012 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

dimitris · January 29, 2013, 2:36pm

Dear 11Bolts,

I just finished running the same “cylinder2d” test with 24 processes on two 12-core nodes. Everything runs fine. For your information, I use OpenMPI. Your “bad_alloc” exceptions show a failure to allocate memory, but I cannot imagine where this would come from, since the problem is a 2D one on a 100x100 grid. How many processes did you actually use??

Best,
D.

11bolts · January 29, 2013, 3:52pm

Hi Dimitris

Thanks for the response. I’ve tried this with many different process numbers. A 16-core run on a 16-core node succeeds, but fails on a pair of 12-core nodes. A 12-core run on a 12-core node succeeds. A 32-core run across two 16-core nodes fails, as does a 17-core run. A 2-core run on a 2-core node succeeds, but a 3-core run on a pair of 2-core nodes fails. This list of combinations is not exhaustive, and I thought it was pretty clear that the problem arose when I tried to do a run involving more than one host/node. Now however, I’ve found that a 13-core run on a pair of 12-core nodes works! I’m sure I tried this combination yesterday and it failed then!

I’ve tried this on two different clusters (one LSF, one grid-engine) and run failure occurs on each. The LSF cluster is new and in testing, but the grid-engine cluster has been operational for many years. Both are using OpenMPI. On grid-engine however, I don’t get the bad_alloc, but rather “an error occured in MPI_RECV”. I’ve also tried this with the Poiseuille showCase, and it works fine under all conditions. I’ve seen this problem in both Palabos 1.1 and Palabos 1.2.

I notice that the makefile for these codes involve Python calls. I haven’t compiled the pythonic directory (From the user guide I believe it isn’t necessary), however I know for a fact that I don’t have Swig or the other libraries mentioned in the pythonic compilation guide. I don’t see why this could have anything to do with my problem, but assuming there isn’t a code bug, I’m out of ideas. Do you have these libraries installed?

Thanks, 11bolts.

11bolts · February 1, 2013, 2:50pm

Update: This appears to be an OpenMPI problem. I can successfully run codes under Platform MPI. I am currently updating from OpenMPI 1.5.5 to 1.6.3 – we’ll see if the problem goes away then.

Thank you for your help.

11bolts · February 5, 2013, 6:45pm

Upgraded to OpenMPI 1.6.3, and the problem still remains. The code runs on a single 16-node host, but fails on a 32-node host. This time, the error message is much shorter:


[n010:94326] *** An error occurred in MPI_Recv
[n010:94326] *** on communicator MPI_COMM_WORLD
[n010:94326] *** MPI_ERR_TRUNCATE: message truncated
[n010:94326] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[n010:94307] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[n010:94307] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Does anyone have any ideas? The cluster is functioning perfectly with other pieces of software, including MPI based software running under OpenMPI. Palabos also works fine under Platform-MPI. I’d really like to be able to run Palabos under OpenMPI.

jonas1 · February 5, 2013, 6:49pm

Hi,

The newest Palabos release (it just came out today) has a bug fix for some versions of MPI. Would you mind trying it out and letting us know if your problem is solved?

Thanks in advance!

Cheers,
Jonas

11bolts · February 5, 2013, 10:07pm

Hello

I’ve tried the new version of Palabos (1.3) with both OpenMPI 1.6.3 and 1.5.5, and the problem still occurs. I’ve tried using both gcc 4.4.6 and 4.6.2 also (both OpenMPI and cylinder2d compiled with the same gcc), and still no success.

Coastlab_lgw · May 5, 2013, 2:50am

Hi, I met this problem some weeks ago. And now I have fixed it, I found that in src/parallelism/sendRecvPool.cpp:


void SendPoolCommunicator::startCommunication(int toProc, bool staticMessage)
{
    std::map<int,CommunicatorEntry>::iterator entryPtr = subscriptions.find(toProc);
    PLB_ASSERT( entryPtr != subscriptions.end() );
    CommunicatorEntry& entry = entryPtr->second;
    [b]std::vector<int> dynamicDataSizes;[/b]
    if (staticMessage) {
        entry.data.resize(entry.cumDataLength);
    }
    else {
        // If the communicated data is non-static, the overall size of transmitted
        //   data must be computed.
        int dynamicDataLength = 0;
        dynamicDataSizes.resize(entry.messages.size());
        for (pluint iMessage=0; iMessage<entry.messages.size(); ++iMessage) {
            dynamicDataLength += entry.messages[iMessage].size();
            dynamicDataSizes[iMessage] = entry.messages[iMessage].size();
        }
        entry.data.resize(dynamicDataLength);
    }
    // Merge the individual messages into a single vector.
    int pos=0;
    for (pluint iMessage=0; iMessage<entry.messages.size(); ++iMessage) {
        PLB_ASSERT( !staticMessage ||
                    ( (int)entry.messages[iMessage].size() == entry.lengths[iMessage] ));
        PLB_ASSERT(pos+entry.messages[iMessage].size() <= entry.data.size());
        if( !entry.messages[iMessage].empty() && !entry.data.empty() ) {
            std::copy(entry.messages[iMessage].begin(),
                      entry.messages[iMessage].end(), entry.data.begin()+pos);
        }
        pos+=entry.messages[iMessage].size();
    }
    if (!staticMessage) {
        PLB_ASSERT(dynamicDataSizes.size()>0);
        global::profiler().increment("mpiSendChar", (plint)dynamicDataSizes.size());
        global::mpi().iSend(&dynamicDataSizes[0], dynamicDataSizes.size(), toProc,
                            &entry.sizeRequest);
    }
    // Empty messages are neither sent nor received.
    if (!entry.data.empty()) {
        global::profiler().increment("mpiSendChar", (plint)entry.data.size());
        global::mpi().iSend(&entry.data[0], entry.data.size(), toProc, &entry.messageRequest);
    }
}

The temporary variable with red color, which would be sent as the sizes of dynamic data to other process, can be rewriten by other program before it is send. Thus caused that the data size other process received was the modified wrong number. Report error as following:


MPI_Wait(157).....................: MPI_Wait(request=0x16b3c6f0, status0x16b3c708) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 3300 bytes received but buffer size is 2640
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x39c2980, status0x39c2998) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 26160 bytes received but buffer size is 21120
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xf57c580, status0xf57c598) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 6144 bytes received but buffer size is 4928
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xb661fc0, status0xb661fd8) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 220 bytes received but buffer size is 176
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1e491720, status0x1e491738) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 48720 bytes received but buffer size is 39424
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x162e1770, status0x162e1788) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 0 truncated; 1744 bytes received but buffer size is 1408
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157): MPI_Wait(request=0x19c393d0, status0x19c393e8) failed
do_cts(490)..: Message truncated; 92160 bytes received but buffer size is 73920
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xd82a0c0, status0xd82a0d8) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 16 and tag 0 truncated; 1744 bytes received but buffer size is 1408
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1febb50, status0x1febb68) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 16 and tag 0 truncated; 6144 bytes received but buffer size is 4928
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xa284360, status0xa284378) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 16 and tag 0 truncated; 220 bytes received but buffer size is 176
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x16c16ba0, status0x16c16bb8) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 16 and tag 0 truncated; 48720 bytes received but buffer size is 39424
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157): MPI_Wait(request=0x6257820, status0x6257838) failed
do_cts(490)..: Message truncated; 92160 bytes received but buffer size is 73920
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x18bf2510, status0x18bf2528) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 4 and tag 0 truncated; 6144 bytes received but buffer size is 4928
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x865ca90, status0x865caa8) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 4 and tag 0 truncated; 1744 bytes received but buffer size is 1408
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1eb2c2f0, status0x1eb2c308) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 4 and tag 0 truncated; 220 bytes received but buffer size is 176
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x765bd70, status0x765bd88) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 12 and tag 0 truncated; 1744 bytes received but buffer size is 1408
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1e047900, status0x1e047918) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 12 and tag 0 truncated; 220 bytes received but buffer size is 176
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1abc2b40, status0x1abc2b58) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 8 and tag 0 truncated; 6144 bytes received but buffer size is 4928
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xdd44f70, status0xdd44f88) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 8 and tag 0 truncated; 1744 bytes received but buffer size is 1408
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xe3e0460, status0xe3e0478) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 8 and tag 0 truncated; 48720 bytes received but buffer size is 39424
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0xc3e8cf0, status0xc3e8d08) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 8 and tag 0 truncated; 220 bytes received but buffer size is 176
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x1855a9f0, status0x1855aa08) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 12 and tag 0 truncated; 6144 bytes received but buffer size is 4928
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x119f07f0, status0x119f0808) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 12 and tag 0 truncated; 48720 bytes received but buffer size is 39424
Fatal error in MPI_Wait: Message truncated, error stack:
MPI_Wait(157).....................: MPI_Wait(request=0x628abf0, status0x628ac08) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 4 and tag 0 truncated; 48720 bytes received but buffer size is 39424
mpiexec: Warning: tasks 0-23 exited with status 1.

So if we replace the temporary variable above (red one) with a global variable or something lives longer than the sending time, it would fix this problem.

jonas1 · May 13, 2013, 8:38pm

Hi Coastlab_lgw,

Thank you for reporting the bug and for the suggested fix! You are of course absolutely right; we have actually been chasing this issue for quite a while now, so far without success.

The bug fix has now been integrated in the newest Palabos release, and as of version 1.4 things should be fine.

Thanks again for your cooperation. This kind of user feedback is of course always most welcome.

Cheers,
Jonas