Re: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded. .

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 12 Aug 2010 10:03:53 -0400

I have thought about this a little more, and I have vague recollections of
deciding to let this message print on all nodes to stdout so that there is
some indication of a simulation going berserk, if you will (ie., you have
velocities in need of adjustment - if this starts happening all over the
place, generally something is not so good). I believe that all these
processes will have stdin, stdout, stderr connected to something, so in my
experience you don't get into trouble doing this, it is just sort of messy.
If we really wanted to handle this properly, we would flow the info back to
the master so it could be reported in mdout, but that possibly has
performance implications (I could actually make this essentially cost-free
with a little effort I believe - I chose to capture the vlimit problems in
the more crude fashion though). A simple test to see if actually doing a
nonmaster write is causing a problem is to pull a write statement out of all
the conditionals in this point in the code and just have all processes blast
the max velocity observed to their stdout - in my experience, this does not
cause a problem; I need to plow through some mpi standards to see if
stdin/stdout/stderr handling is actually specified. You don't want to go
writing other files from any node, but I think using stdout/stderr for
debugging/error reporting, while messy, is safe. I recollect making the
decision to do this based on the annoying problem with shake terminating a
run under some conditions with no error output at all. Anyway, if you want,
please just blast some stuff to stdout from all processes to confirm that
your exact hw/sw can handle this - one never knows for sure about exact
implementations, but this code has been running like this since Amber 9, and
allowing this was a conscious decision; I guess I should check out the specs
and update the comment :-)
Regards - Bob Duke
----- Original Message -----
From: <Don.Bashford.stjude.org>
To: "AMBER Mailing List" <amber.ambermd.org>
Sent: Wednesday, August 11, 2010 5:29 PM
Subject: Re: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded. .


>A few things:
>
> Many years ago, I wrote an MPI program and made the mistake of letting
> all the nodes write to stdout (or was it stderr) and the result was
> just chaotic output files with interleaved output from different
> nodes. But I've recently noticed that seemingly minor tweaks that our
> system managers make to the cluster can have big impacts on the
> performance of MPI programs.
>
> Come to think of it, it's pretty weird that a write to mdout, which I
> had assigned to a specific *.out file, ends up in standard output.
> (In this case stout is some file like, prod1.o411897, which the Sun
> Grid Engine queue system assigns.
>
> I can't be sure it was either the vlimit, or the attempt to warn about
> it that caused the crash. All I have is the proximity in time, and
> oddity in the source code.
>
> I'm assuming that the code snippet I sent was relevant because that's
> the only code I found in src/pmemd/src when I grep for 'vlimit
> exceeded'
>
> About why I got the vlimit in the first place, I don't know since I'm
> following a similar protocol for minimizaiton/equilibration as I did
> before. But I've gone back to the minimization stage to do more
> unrestrained minimization and now I'm back up in production with 5 ns+
> done so far and no problems.
>
> By the way, in sander's runmd.f, the warning call IS protected by an
> "if (master) then" statement.
>
> At Tue, 10 Aug 2010 20:15:32 -0500,
> Robert Duke wrote:
>>
>> Hmmm, not sure exactly what is going on here, but typically (or at least
>> in
>> my experience), you can do a write to stdout on other nodes, BUT it won't
>> end up back in the mdout, as that is the master's stdout. You definitely
>> see, and would want to see, stderr from non-master nodes. I don't know
>> if
>> this is a code modification bug, or if the intent of the comment was to
>> indicate that other writes just won't end up in the "master" stdout -
>> mdout
>> (but I'll look into it). So I guess there may be implementations of mpi
>> and/or specific systems that could have problems with this - but it has
>> been
>> okay on the sort of systems I have run on. I am wondering if you have
>> something else going on to be generating such big vlimits though (even
>> though you have run > 1 nsec) that is causing one of your nodes to error
>> exit. Shake is one place this sort of thing can happen - you can have
>> excessive movement and shake won't be able to converge, so it kills the
>> whole job. I could be completely off base here though; just trying to
>> think
>> of things (in my experience, I have seen shake kill things on somewhat
>> unstable systems, and I have seen mpi hardware errors - connectivity or
>> what-have-you bring things down). You might take your last restart and
>> single step into this (print out each step) and see what you see. Of
>> course, you can also just stub out the reporting for nonmaster nodes, and
>> that will tell you if it is a "thou shalt not write to anything on a
>> nonmaster node error".
>> Regards - Bob Duke
>> ----- Original Message -----
>> From: <Don.Bashford.stjude.org>
>> To: <amber.ambermd.org>
>> Sent: Tuesday, August 10, 2010 8:46 PM
>> Subject: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded
>>
>>
>> >I was running pmemd from Amber10 under MPI on 16 processors and it
>> > crashed with messages to stderr like:
>> >
>> > MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
>> > with errorcode 1
>> >
>> > shortly after emitting the warning to stdout:
>> >
>> > vlimit exceeded for step 747615; vmax = 39.4731
>> >
>> > vlimit had the default value (20.0) after over 1 ns of production, and
>> > this was the only vlimit warning. It looks to me like the problem
>> > comes from around line 733 in amber11/src/pmemd/src/runmd.fpp:
>> >
>> > ! Only violations on the master node are actually reported
>> > ! to avoid both MPI communication and non-master writes.
>> > write(mdout, '(a,i6,a,f10.4)') 'vlimit exceeded for step ',
>> > nstep,
>> > &
>> > '; vmax = ', vmax
>> >
>> > Although the comment says only the master will report, I don't see any
>> > code to actually enforce that. Elsewhere in runmd.fpp, writes to mdout
>> > are
>> > protected by an "if (master) then .... end if" production immediately
>> > around the write statement. So I assume the fix is just to do that
>> > here also.
>> >
>> > I don't know much about MPI. Is it usual for an MPI application to
>> > crash if a non-master tries to write? Is this dependent on your MPI
>> > implementation/environment?
>> >
>> > I experienced this problem in Amber10 with patches up to bugfix 30.
>> > The more recent bugfixes don't seem to cover it, and the problem seems
>> > to still be there in the Amber11 source.
>> >
>> > Don Bashford
>> > Department of Structural Biology
>> > Saint Jude Children's Research Hospital
>> > Memphis, TN
>> >
>> > Email Disclaimer: www.stjude.org/emaildisclaimer
>> >
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> >
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 12 2010 - 07:30:04 PDT
Custom Search