Re: AMBER: MPI Quiescence problem in REMD

From: In Hee Park <ipark.chemistry.ohio-state.edu>
Date: Thu, 16 Aug 2007 18:56:37 -0400 (EDT)

Dear All,

I just wanted to update some information regarding the MPI progress
Quiescence Detection problem during REMD run posted last month.

I just use the mpirun optoin "-q 0" to disables quiescence detection,
which may allow replicas/CPUs to have more time to communicate with each
other. Now, my REMD run consisting of 64 replicas using 64 CPUs for the
dimer system is working well.

Reference site:
http://www.fz-juelich.de/zam/JULI/userinfo/faq/quiescence/

_____________
In-Hee Park

[2007-07-11.Wed.1:08pm] Carlos Simmerling wrote `Re: AMBER: MPI Quiescence...'

  was the MD at these same temperatures? if not, perhaps you are getting
  shake failures etc. are you using more than 1 processor per replica?
  that can hide some error messages.

  On 7/11/07, In Hee Park <ipark.chemistry.ohio-state.edu> wrote:
> > does the same system perform well with MD?
> yes, I tested single MD run out of 64 replicas used REMD, and worked
> okay without any problem.
>
>
> > did you equilibrate each temperature first (outside REMD)?
> yes
>
> > did you get any output in the mdout files? remlog?
> yes, I see normal output files(crd, rst, out, mdout for every relica)
> kept regularly updated up to some time step, but then REMD run
> terminated with `MPI Quiescence problem` message with empty rem.log.
> ===
> MPIRUN: MPI progress Quiescence Detected.
> MPIRUN: 48 out of 64 ranks showed no MPI send or receive progress in 900 seconds.
> ===
> Now I just notice that, however, only 16 replicas (out of 64) reached to
> the final time step (I set NSTEP=1000), whereas other 48 replicas
> stopped earlier time step (like NSTEP=800, 900, 950), which seems to be
> relavant to the MPI Quiescence message shown above.
>
> May I need to extend time step for all the replicas?
> _____________
> In-Hee Park
>
> [2007-07-11.Wed.5:28am] Carlos Simmerling wrote `Re: AMBER: MPI Quiescence...'
>
> it's hard to guess what's going on.
> does the same system perform well with MD?
> did you equilibrate each temperature first (outside REMD)?
> did you get any output in the mdout files? remlog?
> there just isn't enough info to help except that it's likely
> some of the replicas have crashed. do they still show as
> running on the nodes?
>
> On 7/11/07, In Hee Park <ipark.chemistry.ohio-state.edu> wrote:
> > Dear Amber users,
> >
> > I would like to ask you about `MPI Quiescence problem` that I've
> > encounterd during REMD.
> >
> > I was trying to two set of REMD's, each of which consists of 64 replicas
> > for (1) monomer, (2) protein system, respectively using the Amber9 on
> > the AMD/Suse10.1 cluster.
> >
> > For the (1) monomer system REMD run, REMD worked well up to
> > Temp-exchange production run as shown in the attached to this message;
> > "monomer-REMD.result" and "monomer-REMD-pr-Texchange.out".
> >
> > In contrast to monomer case, for the (2) dimer system REMD, I performed
> > additional step -- using the NMR restraint option -- to prevent overflow
> > of rst's for every 64 targeted Temp elevation and indeed obtained
> > non-overflow rst files.
> >
> > Then again, however, dimer-REMD(Temp exchange) doesn't work and the run
> > ended up with following message:
> > ===
> > MPIRUN: MPI progress Quiescence Detected.
> > MPIRUN: 48 out of 64 ranks showed no MPI send or receive progress in 900
> > seconds.
> > ===
> >
> > In order to check whether indeed MPI communication problem related to
> > the cluster system itself or rather relevant to dimer-REMD setup, I ran
> > "monomer-REMD" again (because it was alreday confirmed that working
> > well). It turned out that monomer-REMD ran okay as usual, so MPI system
> > itself is not problematic, but then I have no clue but asking you some
> > help on this issue.
> >
> > For more your information, I attached both monomer and dimer REMD result
> > (showing how temperatures exchanges are performed) and both REMD run's
> > output message.
> >
> > Has anyone ever met this kind of problem before? Thanks a lot.
> >
> > _____________
> > In-Hee Park
> >
> > [2007-06-27.Wed.4:14pm] Carlos Simmerling wrote `Re: AMBER: rst overflow...'
> >
> > I would suggest trying a distance restraint on the center of mass but using
> > the NMR restraint option and setting the r2 distance short (1A)
> > and the r3 to be larger, say 100A. That way it can move between these
> > without any penalty but not move farther than 100A. I have to say, though,
> > that one the dimer dissociates you will have a hard time getting it back.
> > I am not aware of any studies using REMD on multiple chains except for
> > looking at oligomerization of short chains under periodic boundary
> > conditions.
> > I would check the literature to see the current state of the art for
> > figuring out protein-protein interaction- I don't think MD is the way to
> > go. If you know the interface and just want to optimize it, then using
> > shorter
> > distances in the restraint to keep it from dissociating would be better, but
> >
> > you'll have to go carefully and may have to try many variations to find a
> > protocol
> > that works well. It all depends on what you mean by "drastic" changes.
> > I would consider it an unsolved research problem.
> >
> > On 6/27/07, In Hee Park <ipark.chemistry.ohio-state.edu> wrote:
> > >
> > > Dr. Simmerling,
> > >
> > > Thanks for your critical help, your prediction was correct. The dimer
> > > equilibration to the target Temp ended up with overflowing again even
> > > with option "NSCM" used.
> > >
> > > Could you give me more guidance on your suggestion that setting a
> > > restraint to keep the centers of mass from getting too far apart? Since
> > > I am interested in conformational changes around the dimer interface,
> > > which is at the center of mass, I am a bit hesitating just setting up
> > > the typical group restraint around that interface.
> > >
> > > If as long as I am concerned about getting kind of drastic
> > > conformational changes on around the interface, then LES of dimer would
> > > be better?
> > >
> > > Thanks for your help.
> > >
> > > _____________
> > > In-Hee Park
> > >
> > > [2007-06-26.Tue.11:35am] Carlos Simmerling wrote `Re: AMBER: rst
> > > overflow...'
> > >
> > > with a dimer I am not sure if that is correct if the problem is that the
> > > CM of the system (the dimer) is still at the origin.
> > > the monomers may drift far apart. in essence, what you are simulating
> > > is two monomers at infinite dilution. you probably should set a
> > > restraint
> > > to keep the centers of mass from getting too far apart, or write some
> > > code
> > > to keep the monomers inside a virtual box.
> > >
> > > On 6/26/07, David A. Case <case.scripps.edu> wrote:
> > > >
> > > > On Tue, Jun 26, 2007, In Hee Park wrote:
> > > > >
> > > > > Although setting "iwrap=1" is recommended to keep the coordinate
> > > output
> > > > > from overflowing the trajectory file format, this option can be used
> > > PME
> > > > > run only. Now, it just shifting to explicit(or hybrid)-REMD the only
> > > > > possible way to make my dimer REMD possible? Is there no way to
> > > resolve
> > > > > the overflow problem under GB?
> > > > >
> > > >
> > > > I think the nscm option can be used to do what you want for GB runs.
> > > >
> > > > ...dac
> > > >
> > > >
> > > -----------------------------------------------------------------------
> > > > The AMBER Mail Reflector
> > > > To post, send mail to amber.scripps.edu
> > > > To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
> > > >
> > >
> > >
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>


  --
  ===================================================================
  Carlos L. Simmerling, Ph.D.
  Associate Professor Phone: (631) 632-1336
  Center for Structural Biology Fax: (631) 632-1555
  CMM Bldg, Room G80
  Stony Brook University E-mail: carlos.simmerling.gmail.com
  Stony Brook, NY 11794-5115 Web: http://comp.chem.sunysb.edu
  ===================================================================
  -----------------------------------------------------------------------
  The AMBER Mail Reflector
  To post, send mail to amber.scripps.edu
  To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Aug 19 2007 - 06:07:31 PDT
Custom Search