[AMBER] MMPBSA.py MPI problem on TACC Stampede

From: Sorensen, Jesper <jesorensen.ucsd.edu>
Date: Tue, 1 Oct 2013 00:55:49 +0000

Hello all,

I've been running MMPBSA.py jobs on the XSEDE resource TACC Stampede. And the MPI implementation works perfectly up to 64 cores (4 nodes), but when I move to 5 nodes I get this MPI error below. I realize you are not responsible for the TACC resources, but the admins seemed puzzled by the errors and didn't know how to proceed to fix the issue. So I am hoping you have some suggestions.

Amber was compiled using the following:
intel/13.1.1.163
mvapich2/1.9a2
python/2.7.3-epd-7.3.2
mpi4py/1.3

The amber(+tools) installation was updated last on August 13th 2013 and has all bug fixes up until then.
I made sure that there are more frames than cores, so that isn't the issue.

The output from the job looks like this:
TACC: Starting up job 1829723
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
[cli_23]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error, error stack:
MPIR_Init_thread(436)...:
MPID_Init(371)..........: channel initialization failed
MPIDI_CH3_Init(285).....:
MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library
....
[c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process (rank: 19, pid: 119854) exited with status 1
...
[c437-002.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12. MPI process died?
[c437-002.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process (rank: 17, pid: 119852) exited with status 1
...
TACC: MPI job exited with code: 1
TACC: Shutdown complete. Exiting.


Best regards,
Jesper

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Sep 30 2013 - 18:00:04 PDT
Custom Search