Re: [AMBER] NaN with pmemd.cuda from Ross Walker on 2015-08-06 (Amber Archive Aug 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 6 Aug 2015 00:46:39 -0700

Hi Joe,

I don't have much time to look into this right now but my suspicion is that this is a subtle bug related to small systems. Can you check the size of your box dimensions vs the cut off both when you start simulations and when they crash.

Normally the limit on PME simulations is that your shortest box dimension must be at least twice the cutoff to avoid issues with minimum image in the PME sum. The GPU code has a slightly bigger limit due to the way it builds the pairlist. What I suspect is happening is that your system starts off okay but then the box size shrinks and at some point one of the dimensions is too small - I need to check the exact limit - but something close to 2 x (cut + skinnb) is probably about right. When the box gets too small you are then getting a corrupt pair list and this is leading to the NANs. Right now there is no check during a run for the box size being too small - only at the start of a run - for performance reasons.

It might be possible to add a check every ntpr steps or something similar to issue a warning if one is getting close to the limit.

This is a guess right now but if you can confirm this only happens for small systems it will help isolate it.

All the best
Ross

> On Aug 5, 2015, at 7:41 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
>
> Hi Jason,
>
> Thanks. One set of simulations is with MC barostat, another is constant
> volume with scaled MD. We see the behavior in both types of simulations.
> Both types also use Langevin thermostat.
>
> I'm planning on doing the validation check, but I assumed that running with
> the same seed and seeing all of the same energies in the logfile and the
> NaN showing up at the same step was a mini-version of doing those
> validation tests (which are just checking energies from my understanding?).
> Also, since this happens on several of my GPUs (less than a year old) and
> also my colleague's Kepler GPUs at a different institution (also less than
> a year old), it would seem to be a large coincidence for this to be
> simultaneous problems on all of these hardware components I'd think?
>
> Is there any reason to believe that the possibility of water molecules
> getting too close together and causing these problems might happen much
> more frequently with small box sizes than larger systems?
>
> Also, I can confirm that this problem has not been observed in long (100+
> ns) simulations on CPUs.
>
> Thanks,
> Joe
>
>
> --
> Joseph Baker, PhD
> Assistant Professor
> Department of Chemistry
> C101 Science Complex
> The College of New Jersey
> Ewing, NJ 08628
> Phone: (609) 771-3173
> Web: http://bakerj.pages.tcnj.edu/
> <https://sites.google.com/site/bakercompchemlab/>
>
> On Wed, Aug 5, 2015 at 8:10 PM, Jason Swails <jason.swails.gmail.com> wrote:
>
>> On Wed, Aug 5, 2015 at 3:00 PM, Joseph Baker <bakerj.tcnj.edu> wrote:
>>
>>> Hi Ian,
>>>
>>> Thanks for the reply. This appears to happen across several GPU types
>> here,
>>> and the machines have been rebooted recently (this also happened before
>> the
>>> reboot). I have never seen this for any of my larger systems, just these
>>> fairly tiny dipeptide+water box cases. Also, a colleague of mine has seen
>>> this behavior on NVidia Tesla K80s. Running systems again with a
>> different
>>> seed sometimes gets them all the way through to the end without an NaN
>>> error, and sometimes it does not. Looking a little more closely, the
>> NaN's
>>> appear to be showing up for a handful of water molecules in the
>> simulation
>>> (verified by writing out several frames from the nc file using cpptraj as
>>> rst7 and looking at the coordinates). I am writing to binary nc file, so
>>> too large coordinates shouldn't be the problem from what I understand.
>>>
>>
>> The TIPnP water model does not have any van der Waals terms on the
>> hydrogens -- it's expected that the oxygen radius is big enough to shield
>> the hydrogens from a catastrophic collapse.
>>
>> But it may happen that occasionally (very rarely) water molecules get close
>> together, and the electrostatic and van der Waals forces become large for a
>> couple interactions (but with different signs). Since pmemd.cuda
>> accumulates forces in fixed precision (using an unsigned long long int),
>> it's possible that there's an overflow leading to a NaN (particularly if
>> the density is high at that step).
>>
>> Are you using the Monte Carlo barostat? It may be that a proposed volume
>> change is particularly unfavorable (and should be summarily rejected), but
>> it's sending the simulation to NaNdyland as an unfortunate side effect...
>>
>> It would also be good to use the validation suite that Ross Walker has
>> posted on the mailing list before to make sure the GPUs you're using are
>> still good.
>>
>> Hope this helps,
>> Jason
>>
>> --
>> Jason M. Swails
>> BioMaPS,
>> Rutgers University
>> Postdoctoral Researcher
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 06 2015 - 01:00:02 PDT