Re: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded. .

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 11 Aug 2010 17:45:41 -0400

On Wed, Aug 11, 2010 at 5:29 PM, <Don.Bashford.stjude.org> wrote:

> A few things:
>
> Many years ago, I wrote an MPI program and made the mistake of letting
> all the nodes write to stdout (or was it stderr) and the result was
> just chaotic output files with interleaved output from different
> nodes. But I've recently noticed that seemingly minor tweaks that our
> system managers make to the cluster can have big impacts on the
> performance of MPI programs.
>
> Come to think of it, it's pretty weird that a write to mdout, which I
> had assigned to a specific *.out file, ends up in standard output.
> (In this case stout is some file like, prod1.o411897, which the Sun
> Grid Engine queue system assigns.
>

The mdout file is given unit 6, which is stdout in fortran. However, I
believe that this assignment is only done for the master node, which is why
a "write(*,*) blah" unprotected by an if(master) will print once to the
mdout file and nthread - 1 times to the screen (or prod1.o411897, whatever
your stdout stream may direct to). (More correctly, only the master node
has the mdout file opened in unit 6) See the files file_io_dat.fpp,
master_setup.fpp, and pmemd.fpp, and you will see why this happens.


> I can't be sure it was either the vlimit, or the attempt to warn about
> it that caused the crash. All I have is the proximity in time, and
> oddity in the source code.
>
> I'm assuming that the code snippet I sent was relevant because that's
> the only code I found in src/pmemd/src when I grep for 'vlimit
> exceeded'
>

What did stderr say, if anything?


>
> About why I got the vlimit in the first place, I don't know since I'm
> following a similar protocol for minimizaiton/equilibration as I did
> before. But I've gone back to the minimization stage to do more
> unrestrained minimization and now I'm back up in production with 5 ns+
> done so far and no problems.
>

Seems fairly irrelevant now, then...


All the best,
Jason


> By the way, in sander's runmd.f, the warning call IS protected by an
> "if (master) then" statement.
>
> At Tue, 10 Aug 2010 20:15:32 -0500,
> Robert Duke wrote:
> >
> > Hmmm, not sure exactly what is going on here, but typically (or at least
> in
> > my experience), you can do a write to stdout on other nodes, BUT it won't
> > end up back in the mdout, as that is the master's stdout. You definitely
> > see, and would want to see, stderr from non-master nodes. I don't know
> if
> > this is a code modification bug, or if the intent of the comment was to
> > indicate that other writes just won't end up in the "master" stdout -
> mdout
> > (but I'll look into it). So I guess there may be implementations of mpi
> > and/or specific systems that could have problems with this - but it has
> been
> > okay on the sort of systems I have run on. I am wondering if you have
> > something else going on to be generating such big vlimits though (even
> > though you have run > 1 nsec) that is causing one of your nodes to error
> > exit. Shake is one place this sort of thing can happen - you can have
> > excessive movement and shake won't be able to converge, so it kills the
> > whole job. I could be completely off base here though; just trying to
> think
> > of things (in my experience, I have seen shake kill things on somewhat
> > unstable systems, and I have seen mpi hardware errors - connectivity or
> > what-have-you bring things down). You might take your last restart and
> > single step into this (print out each step) and see what you see. Of
> > course, you can also just stub out the reporting for nonmaster nodes, and
> > that will tell you if it is a "thou shalt not write to anything on a
> > nonmaster node error".
> > Regards - Bob Duke
> > ----- Original Message -----
> > From: <Don.Bashford.stjude.org>
> > To: <amber.ambermd.org>
> > Sent: Tuesday, August 10, 2010 8:46 PM
> > Subject: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded
> >
> >
> > >I was running pmemd from Amber10 under MPI on 16 processors and it
> > > crashed with messages to stderr like:
> > >
> > > MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> > > with errorcode 1
> > >
> > > shortly after emitting the warning to stdout:
> > >
> > > vlimit exceeded for step 747615; vmax = 39.4731
> > >
> > > vlimit had the default value (20.0) after over 1 ns of production, and
> > > this was the only vlimit warning. It looks to me like the problem
> > > comes from around line 733 in amber11/src/pmemd/src/runmd.fpp:
> > >
> > > ! Only violations on the master node are actually reported
> > > ! to avoid both MPI communication and non-master writes.
> > > write(mdout, '(a,i6,a,f10.4)') 'vlimit exceeded for step ',
> nstep,
> > > &
> > > '; vmax = ', vmax
> > >
> > > Although the comment says only the master will report, I don't see any
> > > code to actually enforce that. Elsewhere in runmd.fpp, writes to mdout
> > > are
> > > protected by an "if (master) then .... end if" production immediately
> > > around the write statement. So I assume the fix is just to do that
> > > here also.
> > >
> > > I don't know much about MPI. Is it usual for an MPI application to
> > > crash if a non-master tries to write? Is this dependent on your MPI
> > > implementation/environment?
> > >
> > > I experienced this problem in Amber10 with patches up to bugfix 30.
> > > The more recent bugfixes don't seem to cover it, and the problem seems
> > > to still be there in the Amber11 source.
> > >
> > > Don Bashford
> > > Department of Structural Biology
> > > Saint Jude Children's Research Hospital
> > > Memphis, TN
> > >
> > > Email Disclaimer: www.stjude.org/emaildisclaimer
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 11 2010 - 15:00:04 PDT
Custom Search