This is probably not a bug, but it is unexpected behavior, so I thought I’d post here. I’m trying to run sinepipe_perm in the $PALABOS/examples/tutorial folder. I’ve compiled it for MPI (i.e. I changed the appropriate flags in the Makefile) and it compiles fine. When I submit the code to my cluster and all of the processes fit on the same host, everything is fine. However, when the processes are spread across more than one host, the run fails with the error messages below. I’ve reproduced this behavior on two different clusters (one Rocks, Grid-Engine, one RHEL6, LSF). I’ve also used a simple matrix multiplication MPI code on both of these clusters, and there are no problems with multiple hosts in that case. Any ideas?
[n014:10985] *** Process received signal ***
[n014:10985] Signal: Segmentation fault (11)
[n014:10985] Signal code: Address not mapped (1)
[n014:10985] Failing at address: 0x211
[n014:10984] *** An error occurred in MPI_Recv
[n014:10984] *** on communicator MPI_COMM_WORLD
[n014:10984] *** MPI_ERR_TRUNCATE: message truncated
[n014:10984] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[n014:10985] [ 0] /lib64/libpthread.so.0() [0x3a2d60f4a0]
[n014:10985] [ 1] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE18receive_regenerateENS_5Box3DERKSt6vectorIcSaIcEER[n014:10985] [ 2] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE7receiveENS_5Box3DERKSt6vectorIcSaIcEENS_5modif6Mod
[n014:10985] [ 3] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D11communicateERNS_24CommunicationStructure3DERKNS_12MultiBlock3DERS3_NS_5modif6ModifTE[n014:10985] [ 4] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D17duplicateOverlapsERNS_12MultiBlock3DENS_5modif6ModifTE+0x396) [0x5c3f06]
[n014:10985] [ 5] ./permeability(main+0x1c6) [0x4d67c6]
[n014:10985] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3a2d21ecdd]
[n014:10985] [ 7] ./permeability() [0x4cdf89]
[n014:10985] *** End of error message ***
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mc
[n014:10984] [[545,1],2] attempted to send to [[545,1],0]: tag 14
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mc
[n014:10984] [[545,1],2] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../ompi/runtime/omp[n014:10982] *** Process received signal ***
[n014:10982] Signal: Segmentation fault (11)
[n014:10982] Signal code: Address not mapped (1)
[n014:10982] Failing at address: 0x211
[n014:10982] [ 0] /lib64/libpthread.so.0() [0x3a2d60f4a0]
[n014:10982] [ 1] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE18receive_regenerateENS_5Box3DERKSt6vectorIcSaIcEER
[n014:10982] [ 2] ./permeability(_ZN3plb26BlockLatticeDataTransfer3DIdNS_11descriptors15D3Q19DescriptorEE7receiveENS_5Box3DERKSt6vectorIcSaIcEENS_5modif6Mod
[n014:10982] [ 3] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D11communicateERNS_24CommunicationStructure3DERKNS_12MultiBlock3DERS3_NS_5modif6ModifTE
[n014:10982] [ 4] ./permeability(_ZNK3plb27ParallelBlockCommunicator3D17duplicateOverlapsERNS_12MultiBlock3DENS_5modif6ModifTE+0x396) [0x5c3f06]
[n014:10982] [ 5] ./permeability(main+0x1c6) [0x4d67c6]
[n014:10982] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3a2d21ecdd]
[n014:10982] [ 7] ./permeability() [0x4cdf89]
[n014:10982] *** End of error message ***
[n013][[545,1],7][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (1[n013][[545,1],7][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (1
Jan 25 14:13:36 2013 9909 4 8.3 handleTSRegisterTerm(): TS reports task <3> pid <10985> on host<n014> killed or core dumped
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 10980 on
node n014 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------