Re: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded

From: Robert Duke <rduke.email.unc.edu>
Date: Tue, 10 Aug 2010 21:15:32 -0400

Hmmm, not sure exactly what is going on here, but typically (or at least in
my experience), you can do a write to stdout on other nodes, BUT it won't
end up back in the mdout, as that is the master's stdout. You definitely
see, and would want to see, stderr from non-master nodes. I don't know if
this is a code modification bug, or if the intent of the comment was to
indicate that other writes just won't end up in the "master" stdout - mdout
(but I'll look into it). So I guess there may be implementations of mpi
and/or specific systems that could have problems with this - but it has been
okay on the sort of systems I have run on. I am wondering if you have
something else going on to be generating such big vlimits though (even
though you have run > 1 nsec) that is causing one of your nodes to error
exit. Shake is one place this sort of thing can happen - you can have
excessive movement and shake won't be able to converge, so it kills the
whole job. I could be completely off base here though; just trying to think
of things (in my experience, I have seen shake kill things on somewhat
unstable systems, and I have seen mpi hardware errors - connectivity or
what-have-you bring things down). You might take your last restart and
single step into this (print out each step) and see what you see. Of
course, you can also just stub out the reporting for nonmaster nodes, and
that will tell you if it is a "thou shalt not write to anything on a
nonmaster node error".
Regards - Bob Duke
----- Original Message -----
From: <Don.Bashford.stjude.org>
To: <amber.ambermd.org>
Sent: Tuesday, August 10, 2010 8:46 PM
Subject: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded


>I was running pmemd from Amber10 under MPI on 16 processors and it
> crashed with messages to stderr like:
>
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 1
>
> shortly after emitting the warning to stdout:
>
> vlimit exceeded for step 747615; vmax = 39.4731
>
> vlimit had the default value (20.0) after over 1 ns of production, and
> this was the only vlimit warning. It looks to me like the problem
> comes from around line 733 in amber11/src/pmemd/src/runmd.fpp:
>
> ! Only violations on the master node are actually reported
> ! to avoid both MPI communication and non-master writes.
> write(mdout, '(a,i6,a,f10.4)') 'vlimit exceeded for step ', nstep,
> &
> '; vmax = ', vmax
>
> Although the comment says only the master will report, I don't see any
> code to actually enforce that. Elsewhere in runmd.fpp, writes to mdout
> are
> protected by an "if (master) then .... end if" production immediately
> around the write statement. So I assume the fix is just to do that
> here also.
>
> I don't know much about MPI. Is it usual for an MPI application to
> crash if a non-master tries to write? Is this dependent on your MPI
> implementation/environment?
>
> I experienced this problem in Amber10 with patches up to bugfix 30.
> The more recent bugfixes don't seem to cover it, and the problem seems
> to still be there in the Amber11 source.
>
> Don Bashford
> Department of Structural Biology
> Saint Jude Children's Research Hospital
> Memphis, TN
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Aug 10 2010 - 18:30:06 PDT
Custom Search