Re: [AMBER] NaN with pmemd.cuda from Scott Le Grand on 2015-08-07 (Amber Archive Aug 2015)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 7 Aug 2015 01:26:10 -0700

Try reducing the time step a little bit. We have seen something similar
and it was caused by overwhelming fast shake for waters. It could
occasionally reproduce on CPUs but not as easily or as often. Also try
running in DPFP to see if it still happens.

On Friday, August 7, 2015, Joseph Baker <bakerj.tcnj.edu> wrote:

> Hi Ross,
>
> So I should also mention that the simulation still runs to the end and
> finishes, even though there are NaN and *** entries for some energy terms
> (and they show up in the coordinates directly if you try to output a frame
> as rst7 for example).
>
> The NVT simulation is using scaledMD, and its dimension stays stead at
> 26.842 x 26.842 x 26.842. The cutoff is 8.
>
> The NPT simulation is just standard unbiased MD, and it uses the Monte
> Carlo barostat. I've attached the boxinfo.dat file (every 2 ps over the 90
> seconds of sim until the end of the chunk of sim in which the NaN was
> observed). You can see that the box lengths all stay larger than should be
> a problem with the cutoff at 8.
>
> We are currently running boxes with 100, 400, 800, 1200, 1600, 2000, and
> 3000 waters to see if there is some correlation between NaN frequency and
> the number of waters/box size. We'll hopefully have data on that soon as
> well.
>
> Not a problem in terms of when it can be looked into.
>
> Joe
>
>
> --
> Joseph Baker, PhD
> Assistant Professor
> Department of Chemistry
> C101 Science Complex
> The College of New Jersey
> Ewing, NJ 08628
> Phone: (609) 771-3173
> Web: http://bakerj.pages.tcnj.edu/
> <https://sites.google.com/site/bakercompchemlab/>
>
> On Thu, Aug 6, 2015 at 3:46 AM, Ross Walker <ross.rosswalker.co.uk
> <javascript:;>> wrote:
>
> > Hi Joe,
> >
> > I don't have much time to look into this right now but my suspicion is
> > that this is a subtle bug related to small systems. Can you check the
> size
> > of your box dimensions vs the cut off both when you start simulations and
> > when they crash.
> >
> > Normally the limit on PME simulations is that your shortest box dimension
> > must be at least twice the cutoff to avoid issues with minimum image in
> the
> > PME sum. The GPU code has a slightly bigger limit due to the way it
> builds
> > the pairlist. What I suspect is happening is that your system starts off
> > okay but then the box size shrinks and at some point one of the
> dimensions
> > is too small - I need to check the exact limit - but something close to
> 2 x
> > (cut + skinnb) is probably about right. When the box gets too small you
> are
> > then getting a corrupt pair list and this is leading to the NANs. Right
> now
> > there is no check during a run for the box size being too small - only at
> > the start of a run - for performance reasons.
> >
> > It might be possible to add a check every ntpr steps or something similar
> > to issue a warning if one is getting close to the limit.
> >
> > This is a guess right now but if you can confirm this only happens for
> > small systems it will help isolate it.
> >
> > All the best
> > Ross
> >
> > > On Aug 5, 2015, at 7:41 PM, Joseph Baker <bakerj.tcnj.edu
> <javascript:;>> wrote:
> > >
> > > Hi Jason,
> > >
> > > Thanks. One set of simulations is with MC barostat, another is constant
> > > volume with scaled MD. We see the behavior in both types of
> simulations.
> > > Both types also use Langevin thermostat.
> > >
> > > I'm planning on doing the validation check, but I assumed that running
> > with
> > > the same seed and seeing all of the same energies in the logfile and
> the
> > > NaN showing up at the same step was a mini-version of doing those
> > > validation tests (which are just checking energies from my
> > understanding?).
> > > Also, since this happens on several of my GPUs (less than a year old)
> and
> > > also my colleague's Kepler GPUs at a different institution (also less
> > than
> > > a year old), it would seem to be a large coincidence for this to be
> > > simultaneous problems on all of these hardware components I'd think?
> > >
> > > Is there any reason to believe that the possibility of water molecules
> > > getting too close together and causing these problems might happen much
> > > more frequently with small box sizes than larger systems?
> > >
> > > Also, I can confirm that this problem has not been observed in long
> (100+
> > > ns) simulations on CPUs.
> > >
> > > Thanks,
> > > Joe
> > >
> > >
> > > --
> > > Joseph Baker, PhD
> > > Assistant Professor
> > > Department of Chemistry
> > > C101 Science Complex
> > > The College of New Jersey
> > > Ewing, NJ 08628
> > > Phone: (609) 771-3173
> > > Web: http://bakerj.pages.tcnj.edu/
> > > <https://sites.google.com/site/bakercompchemlab/>
> > >
> > > On Wed, Aug 5, 2015 at 8:10 PM, Jason Swails <jason.swails.gmail.com
> <javascript:;>>
> > wrote:
> > >
> > >> On Wed, Aug 5, 2015 at 3:00 PM, Joseph Baker <bakerj.tcnj.edu
> <javascript:;>> wrote:
> > >>
> > >>> Hi Ian,
> > >>>
> > >>> Thanks for the reply. This appears to happen across several GPU types
> > >> here,
> > >>> and the machines have been rebooted recently (this also happened
> before
> > >> the
> > >>> reboot). I have never seen this for any of my larger systems, just
> > these
> > >>> fairly tiny dipeptide+water box cases. Also, a colleague of mine has
> > seen
> > >>> this behavior on NVidia Tesla K80s. Running systems again with a
> > >> different
> > >>> seed sometimes gets them all the way through to the end without an
> NaN
> > >>> error, and sometimes it does not. Looking a little more closely, the
> > >> NaN's
> > >>> appear to be showing up for a handful of water molecules in the
> > >> simulation
> > >>> (verified by writing out several frames from the nc file using
> cpptraj
> > as
> > >>> rst7 and looking at the coordinates). I am writing to binary nc file,
> > so
> > >>> too large coordinates shouldn't be the problem from what I
> understand.
> > >>>
> > >>
> > >> The TIPnP water model does not have any van der Waals terms on the
> > >> hydrogens -- it's expected that the oxygen radius is big enough to
> > shield
> > >> the hydrogens from a catastrophic collapse.
> > >>
> > >> But it may happen that occasionally (very rarely) water molecules get
> > close
> > >> together, and the electrostatic and van der Waals forces become large
> > for a
> > >> couple interactions (but with different signs). Since pmemd.cuda
> > >> accumulates forces in fixed precision (using an unsigned long long
> int),
> > >> it's possible that there's an overflow leading to a NaN (particularly
> if
> > >> the density is high at that step).
> > >>
> > >> Are you using the Monte Carlo barostat? It may be that a proposed
> > volume
> > >> change is particularly unfavorable (and should be summarily rejected),
> > but
> > >> it's sending the simulation to NaNdyland as an unfortunate side
> > effect...
> > >>
> > >> It would also be good to use the validation suite that Ross Walker has
> > >> posted on the mailing list before to make sure the GPUs you're using
> are
> > >> still good.
> > >>
> > >> Hope this helps,
> > >> Jason
> > >>
> > >> --
> > >> Jason M. Swails
> > >> BioMaPS,
> > >> Rutgers University
> > >> Postdoctoral Researcher
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org <javascript:;>
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org <javascript:;>
> > > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org <javascript:;>
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 07 2015 - 01:30:02 PDT