Re: [AMBER] Etot and TEMP become NaN during a simulation on commodity GPUs

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sat, 16 Sep 2017 10:00:21 -0400

Hi Chris,

Good to hear it worked the second time. I can relax that it isn't a bug in the code. I would say that almost always such glitches should lead to NaNs or some kind of GPU upload / download failure. I don't have enough data to say for sure though. What I would say though is that, in my opinion, if an undetectable glitch occurred and lead to a perturbation that resulted in you making an incorrect scientific conclusion then you designed your experiment incorrectly. Nobody should ever be making scientific conclusions based on single sampling of rare events and thus I've never been concerned about such things. Now, if you want to use a GPU to run the autopilot of your passenger plane then it might be a different story - although even then the system should be designed with sufficient redundancy that a single failure should not cascade. ;-)

All the best
Ross

> On Sep 15, 2017, at 2:09 PM, Chris Neale <candrewn.gmail.com> wrote:
>
> Dear Ross:
>
> I guess it was a GPU glitch (or something similar, see below). Is there any
> evidence that these kinds of errors will always lead to NaN's rather than
> some other perturbation that is less obvious?
>
> I re-ran the simulation as suggested and it did not develop NaN values at
> the same timestep as the previous run, though it did have the same dynamics
> up to that point (as observed from energies at e.g. the 31000000th
> timestep, see below).
>
> ###############################################################
> ########### The previous run that had NaN's:
>
>
> NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
> 0.0
> Etot = -240282.7050 EKtot = 104614.0859 EPtot =
> -344896.7909
> BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
> 23693.8946
> UB = 9872.5247 IMP = 331.2119 CMAP =
> -183.0812
> 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
> 13268.2434
> EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 1546525.2111
> SURFTEN =
> 0.0000
> Density =
> 1.0097
> ------------------------------------------------------------------------------
>
>
> NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = NaN PRESS =
> 0.0
> Etot = NaN EKtot = NaN EPtot =
> -344391.5063
> BOND = 7571.4110 ANGLE = 26315.8830 DIHED =
> 23781.7767
> UB = 9897.5359 IMP = 329.2852 CMAP =
> -170.7033
> 1-4 NB = 3577.8462 1-4 EEL = -35569.5495 VDWAALS =
> 12988.7025
> EELEC = -393113.6939 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 1545119.9333
> SURFTEN =
> 0.0000
> Density =
> 1.0106
> ------------------------------------------------------------------------------
>
>
> ###############################################################
> ########### Repeat run:
>
>
> NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
> 0.0
> Etot = -240282.7050 EKtot = 104614.0859 EPtot =
> -344896.7909
> BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
> 23693.8946
> UB = 9872.5247 IMP = 331.2119 CMAP =
> -183.0812
> 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
> 13268.2434
> EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 1546525.2111
> SURFTEN =
> 0.0000
> Density =
> 1.0097
> ------------------------------------------------------------------------------
>
>
> NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = 310.08 PRESS =
> 0.0
> Etot = -239797.5416 EKtot = 104288.2344 EPtot =
> -344085.7760
> BOND = 7579.4264 ANGLE = 26596.7649 DIHED =
> 23685.6631
> UB = 10046.4229 IMP = 337.2731 CMAP =
> -157.9530
> 1-4 NB = 3579.5379 1-4 EEL = -35600.4566 VDWAALS =
> 13288.6101
> EELEC = -393441.0649 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 1544960.3824
> SURFTEN =
> 0.0000
> Density =
> 1.0107
> ------------------------------------------------------------------------------
>
>
>
> Thank you,
> Chris.
>
> On Tue, Sep 12, 2017 at 11:16 AM, Chris Neale <candrewn.gmail.com> wrote:
>
>> Dear Ross:
>>
>> Thanks for the tips. The system has 158,000 atoms. I will run it from the
>> previous segment with the same random seed and with a different random seed
>> and let you know what the results are. The note below (from my previous run
>> that gave NaN's) has me concerned as to whether the "disabling the
>> synchronization of random numbers" will affect the reproducibility, but
>> perhaps this is not an issue:
>>
>> Note: ig = -1. Setting random seed to 875535 based on wallclock time in
>> microseconds and disabling the synchronization of random numbers
>> between tasks to improve performance.
>>
>> I don't mind the occasional glitch on a GPU, but I'm also happy to help
>> track things down if that is useful for you.
>>
>> Thank you,
>> Chris.
>>
>>
>> On Tue, Sep 12, 2017 at 10:43 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Chris,
>>>
>>> How many atoms is your simulation? This issue - a water shoots off to
>>> infinity can happen if your box is too small for the GPU code. The way the
>>> list is built if any edge of the box is less than 2 x (cutoff+skinnb) at
>>> any point during the simulation it causes a failure in the list build and
>>> the net result is often an atom ends up with nan's as its coordinates - in
>>> the case of a water with shake it drags the other atoms with it. If you
>>> want to be sure this is the issue and not an ECC error go back to the
>>> restart file before the nan's occurred and run again with the same random
>>> seed and hardware and you should get an identical (bad) restart file.
>>>
>>> Unfortunately this is expensive to check for on every step and GPUs do
>>> not handle nans in the same way CPUs do and so it is only something that is
>>> checked for when the restart file is loaded to continue a run. At that
>>> point all you can do is discard the previous run and, if you are very close
>>> to the limit run again with a different random seed and you might be lucky,
>>> or switch to NVT, or reduce cut or skinnb or add more water to make your
>>> system bigger.
>>>
>>> That said the pdb output suggest 54K atoms at least so this is a big
>>> system. So it could just be a random glitch in the GPU - not sure ECC would
>>> catch it since it doesn't protect the whole computation path. Try repeating
>>> the simulation identically and see if the error occurs in the same place.
>>> If it does then this maybe a bug in the code we haven't come across before.
>>>
>>> All the best
>>> Ross
>>>
>>>> On Sep 12, 2017, at 12:28 PM, Chris Neale <candrewn.gmail.com> wrote:
>>>>
>>>> Dear users:
>>>>
>>>> Data corruption seems final, so I am backing up to a previous simulation
>>>> segment, but I thought I'd report this in case it is useful to anybody.
>>>>
>>>> I am running Amber 16 pmemd on 2 GPUs, using a charmm force field with a
>>>> topology built in gromacs and then ported to amber with parmed. I've run
>>>> hundreds of microseconds without seeing this type of issue, so my guess
>>> is
>>>> that it is not specific to the system or the forcefield. At the moment,
>>> I'm
>>>> suspecting that it is one of those rare things that would have been
>>> caught
>>>> by ECC if I was using a GPU that supported it (which I am not).
>>>>
>>>> During an attempt to restart my simulation, pmemd gives the error:
>>>>
>>>> | ERROR: NaN(s) found in input coordinates.
>>>> This likely means that something went wrong in the previous
>>>> simulation.
>>>>
>>>>
>>>> And the command:
>>>> ambpdb -p this.prmtop -c v0.5_5.rst
>>>>
>>>> produces output that contains this obvious problem:
>>>>
>>>> ATOM 54207 HW1 SOL 3056 5.239 35.425 112.718 1.00 0.00
>>>> H
>>>> ATOM 54208 HW2 SOL 3056 5.923 36.752 112.464 1.00 0.00
>>>> H
>>>> ATOM 54209 OW SOL 3057 78.934 30.200 23.018 1.00 0.00
>>>> O
>>>> ATOM 54210 HW1 SOL 3057 78.261 30.829 23.279 1.00 0.00
>>>> H
>>>> ATOM 54211 HW2 SOL 3057 79.017 30.317 22.071 1.00 0.00
>>>> H
>>>> ATOM 54212 OW SOL 3058 -nan -nan -nan 1.00 0.00
>>>> O
>>>> ATOM 54213 HW1 SOL 3058 -nan -nan -nan 1.00 0.00
>>>> H
>>>> ATOM 54214 HW2 SOL 3058 -nan -nan -nan 1.00 0.00
>>>> H
>>>> ATOM 54215 OW SOL 3059 49.273 109.879 40.039 1.00 0.00
>>>> O
>>>> ATOM 54216 HW1 SOL 3059 48.566 109.877 39.394 1.00 0.00
>>>> H
>>>> ATOM 54217 HW2 SOL 3059 50.056 109.653 39.537 1.00 0.00
>>>> H
>>>> ATOM 54218 OW SOL 3060 52.061 48.796 41.712 1.00 0.00
>>>> O
>>>> ATOM 54219 HW1 SOL 3060 51.608 49.476 41.214 1.00 0.00
>>>> H
>>>> ATOM 54220 HW2 SOL 3060 52.978 49.072 41.715 1.00 0.00
>>>> H
>>>>
>>>> Looking back at the previous segment of simulation, I can see where the
>>>> Etot term popped from a real number to NaN:
>>>>
>>>> NSTEP = 30750000 TIME(PS) = 2204999.989 TEMP(K) = 309.78 PRESS =
>>>> 0.0
>>>> Etot = -239957.1300 EKtot = 104187.3359 EPtot =
>>>> -344144.4659
>>>> BOND = 7635.7078 ANGLE = 26341.4448 DIHED =
>>>> 23765.9225
>>>> UB = 10045.0177 IMP = 325.3101 CMAP =
>>>> -176.9250
>>>> 1-4 NB = 3598.1690 1-4 EEL = -35454.2960 VDWAALS =
>>>> 13061.4299
>>>> EELEC = -393286.2467 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>>>> 1544909.6127
>>>> SURFTEN =
>>>> 0.0000
>>>> Density =
>>>> 1.0107
>>>> ------------------------------------------------------------
>>> ------------------
>>>>
>>>>
>>>> NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
>>>> 0.0
>>>> Etot = -240282.7050 EKtot = 104614.0859 EPtot =
>>>> -344896.7909
>>>> BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
>>>> 23693.8946
>>>> UB = 9872.5247 IMP = 331.2119 CMAP =
>>>> -183.0812
>>>> 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
>>>> 13268.2434
>>>> EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>>>> 1546525.2111
>>>> SURFTEN =
>>>> 0.0000
>>>> Density =
>>>> 1.0097
>>>> ------------------------------------------------------------
>>> ------------------
>>>>
>>>>
>>>> NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = NaN PRESS =
>>>> 0.0
>>>> Etot = NaN EKtot = NaN EPtot =
>>>> -344391.5063
>>>> BOND = 7571.4110 ANGLE = 26315.8830 DIHED =
>>>> 23781.7767
>>>> UB = 9897.5359 IMP = 329.2852 CMAP =
>>>> -170.7033
>>>> 1-4 NB = 3577.8462 1-4 EEL = -35569.5495 VDWAALS =
>>>> 12988.7025
>>>> EELEC = -393113.6939 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>>>> 1545119.9333
>>>> SURFTEN =
>>>> 0.0000
>>>> Density =
>>>> 1.0106
>>>> ------------------------------------------------------------
>>> ------------------
>>>>
>>>>
>>>> NSTEP = 31500000 TIME(PS) = 2207999.989 TEMP(K) = NaN PRESS =
>>>> 0.0
>>>> Etot = NaN EKtot = NaN EPtot =
>>>> -344914.4568
>>>> BOND = 7521.9425 ANGLE = 26355.6523 DIHED =
>>>> 23813.2940
>>>> UB = 10004.2951 IMP = 324.9560 CMAP =
>>>> -194.2010
>>>> 1-4 NB = 3606.3528 1-4 EEL = -35201.7173 VDWAALS =
>>>> 12831.8700
>>>> EELEC = -393976.9012 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>>>> 1541838.9378
>>>> SURFTEN =
>>>> 0.0000
>>>> Density =
>>>> 1.0127
>>>>
>>>>
>>>>
>>>> #########################
>>>>
>>>> My run parameters are:
>>>>
>>>> A NPT simulation for common production-level simulations -- params
>>>> generally from Charmm-gui + some modifications by CN
>>>> &cntrl
>>>> imin=0, ! No minimization
>>>> irest=1, ! ires=1 for restart and irest=0 for new start
>>>> ntx=5, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
>>>> use them
>>>> ntb=2, ! constant pressure simulation
>>>>
>>>> ! Temperature control
>>>> ntt=3, ! Langevin dynamics
>>>> gamma_ln=1.0, ! Friction coefficient (ps^-1)
>>>> temp0=310.0, ! Target temperature
>>>> tempi=310.0, ! Initial temperature -- has no effect if ntx>3
>>>>
>>>> ! Potential energy control
>>>> cut=12.0, ! nonbonded cutoff, in Angstroms
>>>> fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8
>>> and
>>>> no use of fswitch
>>>>
>>>> ! MD settings
>>>> nstlim=250000000, ! 0.25B steps, 1 us total
>>>> dt=0.004, ! time step (ps)
>>>>
>>>> ! SHAKE
>>>> ntc=2, ! Constrain bonds containing hydrogen
>>>> ntf=2, ! Do not calculate forces of bonds containing hydrogen
>>>>
>>>> ! Control how often information is printed
>>>> ntpr=250000, ! Print energy frequency
>>>> ntwx=250000, ! Print coordinate frequency
>>>> ntwr=500000, ! Print restart file frequency
>>>> ! ntwv=-1, ! Uncomment to also print velocities to trajectory
>>>> ! ntwf=-1, ! Uncomment to also print forces to trajectory
>>>> ntxo=2, ! Write NetCDF format
>>>> ioutfm=1, ! Write NetCDF format (always do this!)
>>>>
>>>> ! Wrap coordinates when printing them to the same unit cell
>>>> iwrap=1,
>>>>
>>>> ! Constant pressure control. Note that ntp=3 requires barostat=1
>>>> barostat=2, ! Berendsen... change to 2 for MC barostat
>>>> ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/
>>> surften
>>>> pres0=1.01325, ! Target external pressure, in bar
>>>> taup=4, ! Berendsen coupling constant (ps)
>>>> comp=45, ! compressibility
>>>>
>>>> ! Constant surface tension (needed for semi-isotropic scaling).
>>>> Uncomment
>>>> ! for this feature. csurften must be nonzero if ntp=3 above
>>>> csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
>>>> gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
>>>> scaling
>>>> ninterface=2, ! Number of interfaces (2 for bilayer)
>>>>
>>>> ! Set water atom/residue names for SETTLE recognition
>>>> watnam='SOL', ! Water residues are named TIP3
>>>> owtnm='OW', ! Water oxygens are named OH2
>>>> hwtnm1='HW1',
>>>> hwtnm2='HW2',
>>>> &end
>>>> &ewald
>>>> vdwmeth = 0,
>>>> &end
>>>>
>>>>
>>>> ##################
>>>>
>>>> and I run like this:
>>>>
>>>> export CUDA_VISIBLE_DEVICES=0,1
>>>> {
>>>> echo "rank 0=localhost slot=0:0"
>>>> echo "rank 1=localhost slot=0:1"
>>>> } > my.rankfile.A
>>>> mpirun --report-bindings --rankfile my.rankfile.A -np 2
>>>> ${AMBERHOME}/bin/pmemd.cuda.MPI -i $amdp -o ${athis}.out -p
>>> this.prmtop -c
>>>> ${aprev}.rst -r ${athis}.rst -x ${athis}.mdcrd -inf ${athis}.info -l
>>>> ${athis}.log
>>>>
>>>> ##################
>>>>
>>>> Thank you,
>>>> Chris.
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Sep 16 2017 - 07:30:02 PDT
Custom Search