Re: [AMBER] NaN with pmemd.cuda

From: Joseph Baker <bakerj.tcnj.edu>
Date: Thu, 6 Aug 2015 14:59:32 -0400

Hi Ross,

So I should also mention that the simulation still runs to the end and
finishes, even though there are NaN and *** entries for some energy terms
(and they show up in the coordinates directly if you try to output a frame
as rst7 for example).

The NVT simulation is using scaledMD, and its dimension stays stead at
26.842 x 26.842 x 26.842. The cutoff is 8.

The NPT simulation is just standard unbiased MD, and it uses the Monte
Carlo barostat. I've attached the boxinfo.dat file (every 2 ps over the 90
seconds of sim until the end of the chunk of sim in which the NaN was
observed). You can see that the box lengths all stay larger than should be
a problem with the cutoff at 8.

We are currently running boxes with 100, 400, 800, 1200, 1600, 2000, and
3000 waters to see if there is some correlation between NaN frequency and
the number of waters/box size. We'll hopefully have data on that soon as
well.

Not a problem in terms of when it can be looked into.

Joe


--
Joseph Baker, PhD
Assistant Professor
Department of Chemistry
C101 Science Complex
The College of New Jersey
Ewing, NJ 08628
Phone: (609) 771-3173
Web: http://bakerj.pages.tcnj.edu/
<https://sites.google.com/site/bakercompchemlab/>
On Thu, Aug 6, 2015 at 3:46 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Joe,
>
> I don't have much time to look into this right now but my suspicion is
> that this is a subtle bug related to small systems. Can you check the size
> of your box dimensions vs the cut off both when you start simulations and
> when they crash.
>
> Normally the limit on PME simulations is that your shortest box dimension
> must be at least twice the cutoff to avoid issues with minimum image in the
> PME sum. The GPU code has a slightly bigger limit due to the way it builds
> the pairlist. What I suspect is happening is that your system starts off
> okay but then the box size shrinks and at some point one of the dimensions
> is too small - I need to check the exact limit - but something close to 2 x
> (cut + skinnb) is probably about right. When the box gets too small you are
> then getting a corrupt pair list and this is leading to the NANs. Right now
> there is no check during a run for the box size being too small - only at
> the start of a run - for performance reasons.
>
> It might be possible to add a check every ntpr steps or something similar
> to issue a warning if one is getting close to the limit.
>
> This is a guess right now but if you can confirm this only happens for
> small systems it will help isolate it.
>
> All the best
> Ross
>
> > On Aug 5, 2015, at 7:41 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
> >
> > Hi Jason,
> >
> > Thanks. One set of simulations is with MC barostat, another is constant
> > volume with scaled MD. We see the behavior in both types of simulations.
> > Both types also use Langevin thermostat.
> >
> > I'm planning on doing the validation check, but I assumed that running
> with
> > the same seed and seeing all of the same energies in the logfile and the
> > NaN showing up at the same step was a mini-version of doing those
> > validation tests (which are just checking energies from my
> understanding?).
> > Also, since this happens on several of my GPUs (less than a year old) and
> > also my colleague's Kepler GPUs at a different institution (also less
> than
> > a year old), it would seem to be a large coincidence for this to be
> > simultaneous problems on all of these hardware components I'd think?
> >
> > Is there any reason to believe that the possibility of water molecules
> > getting too close together and causing these problems might happen much
> > more frequently with small box sizes than larger systems?
> >
> > Also, I can confirm that this problem has not been observed in long (100+
> > ns) simulations on CPUs.
> >
> > Thanks,
> > Joe
> >
> >
> > --
> > Joseph Baker, PhD
> > Assistant Professor
> > Department of Chemistry
> > C101 Science Complex
> > The College of New Jersey
> > Ewing, NJ 08628
> > Phone: (609) 771-3173
> > Web: http://bakerj.pages.tcnj.edu/
> > <https://sites.google.com/site/bakercompchemlab/>
> >
> > On Wed, Aug 5, 2015 at 8:10 PM, Jason Swails <jason.swails.gmail.com>
> wrote:
> >
> >> On Wed, Aug 5, 2015 at 3:00 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
> >>
> >>> Hi Ian,
> >>>
> >>> Thanks for the reply. This appears to happen across several GPU types
> >> here,
> >>> and the machines have been rebooted recently (this also happened before
> >> the
> >>> reboot). I have never seen this for any of my larger systems, just
> these
> >>> fairly tiny dipeptide+water box cases. Also, a colleague of mine has
> seen
> >>> this behavior on NVidia Tesla K80s. Running systems again with a
> >> different
> >>> seed sometimes gets them all the way through to the end without an NaN
> >>> error, and sometimes it does not. Looking a little more closely, the
> >> NaN's
> >>> appear to be showing up for a handful of water molecules in the
> >> simulation
> >>> (verified by writing out several frames from the nc file using cpptraj
> as
> >>> rst7 and looking at the coordinates). I am writing to binary nc file,
> so
> >>> too large coordinates shouldn't be the problem from what I understand.
> >>>
> >>
> >> ‚ÄčThe TIPnP water model does not have any van der Waals terms on the
> >> hydrogens -- it's expected that the oxygen radius is big enough to
> shield
> >> the hydrogens from a catastrophic collapse.
> >>
> >> But it may happen that occasionally (very rarely) water molecules get
> close
> >> together, and the electrostatic and van der Waals forces become large
> for a
> >> couple interactions (but with different signs).  Since pmemd.cuda
> >> accumulates forces in fixed precision (using an unsigned long long int),
> >> it's possible that there's an overflow leading to a NaN (particularly if
> >> the density is high at that step).
> >>
> >> Are you using the Monte Carlo barostat?  It may be that a proposed
> volume
> >> change is particularly unfavorable (and should be summarily rejected),
> but
> >> it's sending the simulation to NaNdyland as an unfortunate side
> effect...
> >>
> >> It would also be good to use the validation suite that Ross Walker has
> >> posted on the mailing list before to make sure the GPUs you're using are
> >> still good.
> >>
> >> Hope this helps,
> >> Jason
> >>
> >> --
> >> Jason M. Swails
> >> BioMaPS,
> >> Rutgers University
> >> Postdoctoral Researcher
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber



Received on Thu Aug 06 2015 - 12:30:03 PDT
Custom Search