Re: [AMBER] MMPBSA.py MPI problem on TACC Stampede from Sorensen, Jesper on 2013-10-01 (Amber Archive Oct 2013)

From: Sorensen, Jesper <jesorensen.ucsd.edu>
Date: Tue, 1 Oct 2013 17:19:11 +0000

Hi Jason,

Thanks for the reply. We'll probably just stick to 64 cores for now. That still does a nice job.

Best,
Jesper

On Oct 1, 2013, at 4:36 AM, Jason Swails <jason.swails.gmail.com> wrote:

> On Mon, Sep 30, 2013 at 8:55 PM, Sorensen, Jesper <jesorensen.ucsd.edu>wrote:
>
>> Hello all,
>>
>> I've been running MMPBSA.py jobs on the XSEDE resource TACC Stampede. And
>> the MPI implementation works perfectly up to 64 cores (4 nodes), but when I
>> move to 5 nodes I get this MPI error below. I realize you are not
>> responsible for the TACC resources, but the admins seemed puzzled by the
>> errors and didn't know how to proceed to fix the issue. So I am hoping you
>> have some suggestions.
>>
>> Amber was compiled using the following:
>> intel/13.1.1.163
>> mvapich2/1.9a2
>> python/2.7.3-epd-7.3.2
>> mpi4py/1.3
>>
>> The amber(+tools) installation was updated last on August 13th 2013 and
>> has all bug fixes up until then.
>> I made sure that there are more frames than cores, so that isn't the issue.
>>
>> The output from the job looks like this:
>> TACC: Starting up job 1829723
>> TACC: Setting up parallel environment for MVAPICH2+mpispawn.
>> TACC: Starting parallel tasks...
>> [cli_23]: aborting job:
>> Fatal error in PMPI_Init_thread:
>> Other MPI error, error stack:
>> MPIR_Init_thread(436)...:
>> MPID_Init(371)..........: channel initialization failed
>> MPIDI_CH3_Init(285).....:
>> MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library
>> ....
>> [c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process
>> (rank: 19, pid: 119854) exited with status 1
>> ...
>> [c437-002.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected
>> End-Of-File on file descriptor 12. MPI process died?
>> [c437-002.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error
>> while reading PMI socket. MPI process died?
>> [c464-404.stampede.tacc.utexas.edu:mpispawn_1][child_handler] MPI process
>> (rank: 17, pid: 119852) exited with status 1
>> ...
>> TACC: MPI job exited with code: 1
>> TACC: Shutdown complete. Exiting.
>>
>
> This seems to be a limitation of mpi4py. I don't know that anybody has
> gotten MMPBSA.py.MPI to run successfully on large numbers of cores (the
> most I've ever tried was 48 cores as reported in our paper). You can try
> downloading and installing the latest mpi4py (version 1.3.1) and seeing if
> that fixes your problem, but short of switching to another parallelization
> library (that works on distributed clusters) there is not much we can do.
>
> I would switch to a threading-based solution if I thought it offered any
> advantage (indeed, I tried to design MMPBSA.py to facilitate the use of
> threads easily if I chose to try it), but I've never seen MMPBSA.py.MPI
> have problems using every core on a node through MPI [and the threading
> approach is SMP-only].
>
> All the best,
> Jason
>
> --
> Jason M. Swails
> BioMaPS,
> Rutgers University
> Postdoctoral Researcher
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 01 2013 - 10:30:04 PDT