Re: [AMBER] NaN with pmemd.cuda

From: Joseph Baker <bakerj.tcnj.edu> Date: Thu, 6 Aug 2015 14:59:32 -0400

--
Joseph Baker, PhD
Assistant Professor
Department of Chemistry
C101 Science Complex
The College of New Jersey
Ewing, NJ 08628
Phone: (609) 771-3173
Web: http://bakerj.pages.tcnj.edu/
<https://sites.google.com/site/bakercompchemlab/>
On Thu, Aug 6, 2015 at 3:46 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Joe,
>
> I don't have much time to look into this right now but my suspicion is
> that this is a subtle bug related to small systems. Can you check the size
> of your box dimensions vs the cut off both when you start simulations and
> when they crash.
>
> Normally the limit on PME simulations is that your shortest box dimension
> must be at least twice the cutoff to avoid issues with minimum image in the
> PME sum. The GPU code has a slightly bigger limit due to the way it builds
> the pairlist. What I suspect is happening is that your system starts off
> okay but then the box size shrinks and at some point one of the dimensions
> is too small - I need to check the exact limit - but something close to 2 x
> (cut + skinnb) is probably about right. When the box gets too small you are
> then getting a corrupt pair list and this is leading to the NANs. Right now
> there is no check during a run for the box size being too small - only at
> the start of a run - for performance reasons.
>
> It might be possible to add a check every ntpr steps or something similar
> to issue a warning if one is getting close to the limit.
>
> This is a guess right now but if you can confirm this only happens for
> small systems it will help isolate it.
>
> All the best
> Ross
>
> > On Aug 5, 2015, at 7:41 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
> >
> > Hi Jason,
> >
> > Thanks. One set of simulations is with MC barostat, another is constant
> > volume with scaled MD. We see the behavior in both types of simulations.
> > Both types also use Langevin thermostat.
> >
> > I'm planning on doing the validation check, but I assumed that running
> with
> > the same seed and seeing all of the same energies in the logfile and the
> > NaN showing up at the same step was a mini-version of doing those
> > validation tests (which are just checking energies from my
> understanding?).
> > Also, since this happens on several of my GPUs (less than a year old) and
> > also my colleague's Kepler GPUs at a different institution (also less
> than
> > a year old), it would seem to be a large coincidence for this to be
> > simultaneous problems on all of these hardware components I'd think?
> >
> > Is there any reason to believe that the possibility of water molecules
> > getting too close together and causing these problems might happen much
> > more frequently with small box sizes than larger systems?
> >
> > Also, I can confirm that this problem has not been observed in long (100+
> > ns) simulations on CPUs.
> >
> > Thanks,
> > Joe
> >
> >
> > --
> > Joseph Baker, PhD
> > Assistant Professor
> > Department of Chemistry
> > C101 Science Complex
> > The College of New Jersey
> > Ewing, NJ 08628
> > Phone: (609) 771-3173
> > Web: http://bakerj.pages.tcnj.edu/
> > <https://sites.google.com/site/bakercompchemlab/>
> >
> > On Wed, Aug 5, 2015 at 8:10 PM, Jason Swails <jason.swails.gmail.com>
> wrote:
> >
> >> On Wed, Aug 5, 2015 at 3:00 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
> >>
> >>> Hi Ian,
> >>>
> >>> Thanks for the reply. This appears to happen across several GPU types
> >> here,
> >>> and the machines have been rebooted recently (this also happened before
> >> the
> >>> reboot). I have never seen this for any of my larger systems, just
> these
> >>> fairly tiny dipeptide+water box cases. Also, a colleague of mine has
> seen
> >>> this behavior on NVidia Tesla K80s. Running systems again with a
> >> different
> >>> seed sometimes gets them all the way through to the end without an NaN
> >>> error, and sometimes it does not. Looking a little more closely, the
> >> NaN's
> >>> appear to be showing up for a handful of water molecules in the
> >> simulation
> >>> (verified by writing out several frames from the nc file using cpptraj
> as
> >>> rst7 and looking at the coordinates). I am writing to binary nc file,
> so
> >>> too large coordinates shouldn't be the problem from what I understand.
> >>>
> >>
> >> The TIPnP water model does not have any van der Waals terms on the
> >> hydrogens -- it's expected that the oxygen radius is big enough to
> shield
> >> the hydrogens from a catastrophic collapse.
> >>
> >> But it may happen that occasionally (very rarely) water molecules get
> close
> >> together, and the electrostatic and van der Waals forces become large
> for a
> >> couple interactions (but with different signs).  Since pmemd.cuda
> >> accumulates forces in fixed precision (using an unsigned long long int),
> >> it's possible that there's an overflow leading to a NaN (particularly if
> >> the density is high at that step).
> >>
> >> Are you using the Monte Carlo barostat?  It may be that a proposed
> volume
> >> change is particularly unfavorable (and should be summarily rejected),
> but
> >> it's sending the simulation to NaNdyland as an unfortunate side
> effect...
> >>
> >> It would also be good to use the validation suite that Ross Walker has
> >> posted on the mailing list before to make sure the GPUs you're using are
> >> still good.
> >>
> >> Hope this helps,
> >> Jason
> >>
> >> --
> >> Jason M. Swails
> >> BioMaPS,
> >> Rutgers University
> >> Postdoctoral Researcher
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>