A few things:
Many years ago, I wrote an MPI program and made the mistake of letting
all the nodes write to stdout (or was it stderr) and the result was
just chaotic output files with interleaved output from different
nodes. But I've recently noticed that seemingly minor tweaks that our
system managers make to the cluster can have big impacts on the
performance of MPI programs.
Come to think of it, it's pretty weird that a write to mdout, which I
had assigned to a specific *.out file, ends up in standard output.
(In this case stout is some file like, prod1.o411897, which the Sun
Grid Engine queue system assigns.
I can't be sure it was either the vlimit, or the attempt to warn about
it that caused the crash. All I have is the proximity in time, and
oddity in the source code.
I'm assuming that the code snippet I sent was relevant because that's
the only code I found in src/pmemd/src when I grep for 'vlimit
exceeded'
About why I got the vlimit in the first place, I don't know since I'm
following a similar protocol for minimizaiton/equilibration as I did
before. But I've gone back to the minimization stage to do more
unrestrained minimization and now I'm back up in production with 5 ns+
done so far and no problems.
By the way, in sander's runmd.f, the warning call IS protected by an
"if (master) then" statement.
At Tue, 10 Aug 2010 20:15:32 -0500,
Robert Duke wrote:
>
> Hmmm, not sure exactly what is going on here, but typically (or at least in
> my experience), you can do a write to stdout on other nodes, BUT it won't
> end up back in the mdout, as that is the master's stdout. You definitely
> see, and would want to see, stderr from non-master nodes. I don't know if
> this is a code modification bug, or if the intent of the comment was to
> indicate that other writes just won't end up in the "master" stdout - mdout
> (but I'll look into it). So I guess there may be implementations of mpi
> and/or specific systems that could have problems with this - but it has been
> okay on the sort of systems I have run on. I am wondering if you have
> something else going on to be generating such big vlimits though (even
> though you have run > 1 nsec) that is causing one of your nodes to error
> exit. Shake is one place this sort of thing can happen - you can have
> excessive movement and shake won't be able to converge, so it kills the
> whole job. I could be completely off base here though; just trying to think
> of things (in my experience, I have seen shake kill things on somewhat
> unstable systems, and I have seen mpi hardware errors - connectivity or
> what-have-you bring things down). You might take your last restart and
> single step into this (print out each step) and see what you see. Of
> course, you can also just stub out the reporting for nonmaster nodes, and
> that will tell you if it is a "thou shalt not write to anything on a
> nonmaster node error".
> Regards - Bob Duke
> ----- Original Message -----
> From: <Don.Bashford.stjude.org>
> To: <amber.ambermd.org>
> Sent: Tuesday, August 10, 2010 8:46 PM
> Subject: [AMBER] BUG and FIX: pmemd crashes when vlimit exceeded
>
>
> >I was running pmemd from Amber10 under MPI on 16 processors and it
> > crashed with messages to stderr like:
> >
> > MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> > with errorcode 1
> >
> > shortly after emitting the warning to stdout:
> >
> > vlimit exceeded for step 747615; vmax = 39.4731
> >
> > vlimit had the default value (20.0) after over 1 ns of production, and
> > this was the only vlimit warning. It looks to me like the problem
> > comes from around line 733 in amber11/src/pmemd/src/runmd.fpp:
> >
> > ! Only violations on the master node are actually reported
> > ! to avoid both MPI communication and non-master writes.
> > write(mdout, '(a,i6,a,f10.4)') 'vlimit exceeded for step ', nstep,
> > &
> > '; vmax = ', vmax
> >
> > Although the comment says only the master will report, I don't see any
> > code to actually enforce that. Elsewhere in runmd.fpp, writes to mdout
> > are
> > protected by an "if (master) then .... end if" production immediately
> > around the write statement. So I assume the fix is just to do that
> > here also.
> >
> > I don't know much about MPI. Is it usual for an MPI application to
> > crash if a non-master tries to write? Is this dependent on your MPI
> > implementation/environment?
> >
> > I experienced this problem in Amber10 with patches up to bugfix 30.
> > The more recent bugfixes don't seem to cover it, and the problem seems
> > to still be there in the Amber11 source.
> >
> > Don Bashford
> > Department of Structural Biology
> > Saint Jude Children's Research Hospital
> > Memphis, TN
> >
> > Email Disclaimer: www.stjude.org/emaildisclaimer
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 11 2010 - 15:00:03 PDT