On Mon, Sep 30, 2013 at 8:55 PM, Sorensen, Jesper <jesorensen.ucsd.edu>wrote:
> Hello all,
>
> I've been running MMPBSA.py jobs on the XSEDE resource TACC Stampede. And
> the MPI implementation works perfectly up to 64 cores (4 nodes), but when I
> move to 5 nodes I get this MPI error below. I realize you are not
> responsible for the TACC resources, but the admins seemed puzzled by the
> errors and didn't know how to proceed to fix the issue. So I am hoping you
> have some suggestions.
>
> Amber was compiled using the following:
> intel/13.1.1.163
> mvapich2/1.9a2
> python/2.7.3-epd-7.3.2
> mpi4py/1.3
>
> The amber(+tools) installation was updated last on August 13th 2013 and
> has all bug fixes up until then.
> I made sure that there are more frames than cores, so that isn't the issue.
>
> The output from the job looks like this:
> TACC: Starting up job 1829723
> TACC: Setting up parallel environment for MVAPICH2+mpispawn.
> TACC: Starting parallel tasks...
> [cli_23]: aborting job:
> Fatal error in PMPI_Init_thread:
> Other MPI error, error stack:
> MPIR_Init_thread(436)...:
> MPID_Init(371)..........: channel initialization failed
> MPIDI_CH3_Init(285).....:
> MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library
> ....
> [c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process
> (rank: 19, pid: 119854) exited with status 1
> ...
> [c437-002.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected
> End-Of-File on file descriptor 12. MPI process died?
> [c437-002.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process
> (rank: 17, pid: 119852) exited with status 1
> ...
> TACC: MPI job exited with code: 1
> TACC: Shutdown complete. Exiting.
>
This seems to be a limitation of mpi4py. I don't know that anybody has
gotten MMPBSA.py.MPI to run successfully on large numbers of cores (the
most I've ever tried was 48 cores as reported in our paper). You can try
downloading and installing the latest mpi4py (version 1.3.1) and seeing if
that fixes your problem, but short of switching to another parallelization
library (that works on distributed clusters) there is not much we can do.
I would switch to a threading-based solution if I thought it offered any
advantage (indeed, I tried to design MMPBSA.py to facilitate the use of
threads easily if I chose to try it), but I've never seen MMPBSA.py.MPI
have problems using every core on a node through MPI [and the threading
approach is SMP-only].
All the best,
Jason
--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 01 2013 - 05:00:06 PDT