Re: [AMBER] Etot and TEMP become NaN during a simulation on commodity GPUs

From: Chris Neale <candrewn.gmail.com>
Date: Fri, 15 Sep 2017 12:09:34 -0600

Dear Ross:

I guess it was a GPU glitch (or something similar, see below). Is there any
evidence that these kinds of errors will always lead to NaN's rather than
some other perturbation that is less obvious?

I re-ran the simulation as suggested and it did not develop NaN values at
the same timestep as the previous run, though it did have the same dynamics
up to that point (as observed from energies at e.g. the 31000000th
timestep, see below).

###############################################################
########### The previous run that had NaN's:


 NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
0.0
 Etot = -240282.7050 EKtot = 104614.0859 EPtot =
-344896.7909
 BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
23693.8946
 UB = 9872.5247 IMP = 331.2119 CMAP =
 -183.0812
 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
13268.2434
 EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
1546525.2111
                                                    SURFTEN =
0.0000
                                                    Density =
1.0097
 ------------------------------------------------------------------------------


 NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = NaN PRESS =
0.0
 Etot = NaN EKtot = NaN EPtot =
-344391.5063
 BOND = 7571.4110 ANGLE = 26315.8830 DIHED =
23781.7767
 UB = 9897.5359 IMP = 329.2852 CMAP =
 -170.7033
 1-4 NB = 3577.8462 1-4 EEL = -35569.5495 VDWAALS =
12988.7025
 EELEC = -393113.6939 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
1545119.9333
                                                    SURFTEN =
0.0000
                                                    Density =
1.0106
 ------------------------------------------------------------------------------


###############################################################
########### Repeat run:


 NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
0.0
 Etot = -240282.7050 EKtot = 104614.0859 EPtot =
-344896.7909
 BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
23693.8946
 UB = 9872.5247 IMP = 331.2119 CMAP =
 -183.0812
 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
13268.2434
 EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
1546525.2111
                                                    SURFTEN =
0.0000
                                                    Density =
1.0097
 ------------------------------------------------------------------------------


 NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = 310.08 PRESS =
0.0
 Etot = -239797.5416 EKtot = 104288.2344 EPtot =
-344085.7760
 BOND = 7579.4264 ANGLE = 26596.7649 DIHED =
23685.6631
 UB = 10046.4229 IMP = 337.2731 CMAP =
 -157.9530
 1-4 NB = 3579.5379 1-4 EEL = -35600.4566 VDWAALS =
13288.6101
 EELEC = -393441.0649 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
1544960.3824
                                                    SURFTEN =
0.0000
                                                    Density =
1.0107
 ------------------------------------------------------------------------------



Thank you,
Chris.

On Tue, Sep 12, 2017 at 11:16 AM, Chris Neale <candrewn.gmail.com> wrote:

> Dear Ross:
>
> Thanks for the tips. The system has 158,000 atoms. I will run it from the
> previous segment with the same random seed and with a different random seed
> and let you know what the results are. The note below (from my previous run
> that gave NaN's) has me concerned as to whether the "disabling the
> synchronization of random numbers" will affect the reproducibility, but
> perhaps this is not an issue:
>
> Note: ig = -1. Setting random seed to 875535 based on wallclock time in
> microseconds and disabling the synchronization of random numbers
> between tasks to improve performance.
>
> I don't mind the occasional glitch on a GPU, but I'm also happy to help
> track things down if that is useful for you.
>
> Thank you,
> Chris.
>
>
> On Tue, Sep 12, 2017 at 10:43 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
>> Hi Chris,
>>
>> How many atoms is your simulation? This issue - a water shoots off to
>> infinity can happen if your box is too small for the GPU code. The way the
>> list is built if any edge of the box is less than 2 x (cutoff+skinnb) at
>> any point during the simulation it causes a failure in the list build and
>> the net result is often an atom ends up with nan's as its coordinates - in
>> the case of a water with shake it drags the other atoms with it. If you
>> want to be sure this is the issue and not an ECC error go back to the
>> restart file before the nan's occurred and run again with the same random
>> seed and hardware and you should get an identical (bad) restart file.
>>
>> Unfortunately this is expensive to check for on every step and GPUs do
>> not handle nans in the same way CPUs do and so it is only something that is
>> checked for when the restart file is loaded to continue a run. At that
>> point all you can do is discard the previous run and, if you are very close
>> to the limit run again with a different random seed and you might be lucky,
>> or switch to NVT, or reduce cut or skinnb or add more water to make your
>> system bigger.
>>
>> That said the pdb output suggest 54K atoms at least so this is a big
>> system. So it could just be a random glitch in the GPU - not sure ECC would
>> catch it since it doesn't protect the whole computation path. Try repeating
>> the simulation identically and see if the error occurs in the same place.
>> If it does then this maybe a bug in the code we haven't come across before.
>>
>> All the best
>> Ross
>>
>> > On Sep 12, 2017, at 12:28 PM, Chris Neale <candrewn.gmail.com> wrote:
>> >
>> > Dear users:
>> >
>> > Data corruption seems final, so I am backing up to a previous simulation
>> > segment, but I thought I'd report this in case it is useful to anybody.
>> >
>> > I am running Amber 16 pmemd on 2 GPUs, using a charmm force field with a
>> > topology built in gromacs and then ported to amber with parmed. I've run
>> > hundreds of microseconds without seeing this type of issue, so my guess
>> is
>> > that it is not specific to the system or the forcefield. At the moment,
>> I'm
>> > suspecting that it is one of those rare things that would have been
>> caught
>> > by ECC if I was using a GPU that supported it (which I am not).
>> >
>> > During an attempt to restart my simulation, pmemd gives the error:
>> >
>> > | ERROR: NaN(s) found in input coordinates.
>> > This likely means that something went wrong in the previous
>> > simulation.
>> >
>> >
>> > And the command:
>> > ambpdb -p this.prmtop -c v0.5_5.rst
>> >
>> > produces output that contains this obvious problem:
>> >
>> > ATOM 54207 HW1 SOL 3056 5.239 35.425 112.718 1.00 0.00
>> > H
>> > ATOM 54208 HW2 SOL 3056 5.923 36.752 112.464 1.00 0.00
>> > H
>> > ATOM 54209 OW SOL 3057 78.934 30.200 23.018 1.00 0.00
>> > O
>> > ATOM 54210 HW1 SOL 3057 78.261 30.829 23.279 1.00 0.00
>> > H
>> > ATOM 54211 HW2 SOL 3057 79.017 30.317 22.071 1.00 0.00
>> > H
>> > ATOM 54212 OW SOL 3058 -nan -nan -nan 1.00 0.00
>> > O
>> > ATOM 54213 HW1 SOL 3058 -nan -nan -nan 1.00 0.00
>> > H
>> > ATOM 54214 HW2 SOL 3058 -nan -nan -nan 1.00 0.00
>> > H
>> > ATOM 54215 OW SOL 3059 49.273 109.879 40.039 1.00 0.00
>> > O
>> > ATOM 54216 HW1 SOL 3059 48.566 109.877 39.394 1.00 0.00
>> > H
>> > ATOM 54217 HW2 SOL 3059 50.056 109.653 39.537 1.00 0.00
>> > H
>> > ATOM 54218 OW SOL 3060 52.061 48.796 41.712 1.00 0.00
>> > O
>> > ATOM 54219 HW1 SOL 3060 51.608 49.476 41.214 1.00 0.00
>> > H
>> > ATOM 54220 HW2 SOL 3060 52.978 49.072 41.715 1.00 0.00
>> > H
>> >
>> > Looking back at the previous segment of simulation, I can see where the
>> > Etot term popped from a real number to NaN:
>> >
>> > NSTEP = 30750000 TIME(PS) = 2204999.989 TEMP(K) = 309.78 PRESS =
>> > 0.0
>> > Etot = -239957.1300 EKtot = 104187.3359 EPtot =
>> > -344144.4659
>> > BOND = 7635.7078 ANGLE = 26341.4448 DIHED =
>> > 23765.9225
>> > UB = 10045.0177 IMP = 325.3101 CMAP =
>> > -176.9250
>> > 1-4 NB = 3598.1690 1-4 EEL = -35454.2960 VDWAALS =
>> > 13061.4299
>> > EELEC = -393286.2467 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>> > 1544909.6127
>> > SURFTEN =
>> > 0.0000
>> > Density =
>> > 1.0107
>> > ------------------------------------------------------------
>> ------------------
>> >
>> >
>> > NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
>> > 0.0
>> > Etot = -240282.7050 EKtot = 104614.0859 EPtot =
>> > -344896.7909
>> > BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
>> > 23693.8946
>> > UB = 9872.5247 IMP = 331.2119 CMAP =
>> > -183.0812
>> > 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
>> > 13268.2434
>> > EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>> > 1546525.2111
>> > SURFTEN =
>> > 0.0000
>> > Density =
>> > 1.0097
>> > ------------------------------------------------------------
>> ------------------
>> >
>> >
>> > NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = NaN PRESS =
>> > 0.0
>> > Etot = NaN EKtot = NaN EPtot =
>> > -344391.5063
>> > BOND = 7571.4110 ANGLE = 26315.8830 DIHED =
>> > 23781.7767
>> > UB = 9897.5359 IMP = 329.2852 CMAP =
>> > -170.7033
>> > 1-4 NB = 3577.8462 1-4 EEL = -35569.5495 VDWAALS =
>> > 12988.7025
>> > EELEC = -393113.6939 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>> > 1545119.9333
>> > SURFTEN =
>> > 0.0000
>> > Density =
>> > 1.0106
>> > ------------------------------------------------------------
>> ------------------
>> >
>> >
>> > NSTEP = 31500000 TIME(PS) = 2207999.989 TEMP(K) = NaN PRESS =
>> > 0.0
>> > Etot = NaN EKtot = NaN EPtot =
>> > -344914.4568
>> > BOND = 7521.9425 ANGLE = 26355.6523 DIHED =
>> > 23813.2940
>> > UB = 10004.2951 IMP = 324.9560 CMAP =
>> > -194.2010
>> > 1-4 NB = 3606.3528 1-4 EEL = -35201.7173 VDWAALS =
>> > 12831.8700
>> > EELEC = -393976.9012 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
>> > 1541838.9378
>> > SURFTEN =
>> > 0.0000
>> > Density =
>> > 1.0127
>> >
>> >
>> >
>> > #########################
>> >
>> > My run parameters are:
>> >
>> > A NPT simulation for common production-level simulations -- params
>> > generally from Charmm-gui + some modifications by CN
>> > &cntrl
>> > imin=0, ! No minimization
>> > irest=1, ! ires=1 for restart and irest=0 for new start
>> > ntx=5, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
>> > use them
>> > ntb=2, ! constant pressure simulation
>> >
>> > ! Temperature control
>> > ntt=3, ! Langevin dynamics
>> > gamma_ln=1.0, ! Friction coefficient (ps^-1)
>> > temp0=310.0, ! Target temperature
>> > tempi=310.0, ! Initial temperature -- has no effect if ntx>3
>> >
>> > ! Potential energy control
>> > cut=12.0, ! nonbonded cutoff, in Angstroms
>> > fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8
>> and
>> > no use of fswitch
>> >
>> > ! MD settings
>> > nstlim=250000000, ! 0.25B steps, 1 us total
>> > dt=0.004, ! time step (ps)
>> >
>> > ! SHAKE
>> > ntc=2, ! Constrain bonds containing hydrogen
>> > ntf=2, ! Do not calculate forces of bonds containing hydrogen
>> >
>> > ! Control how often information is printed
>> > ntpr=250000, ! Print energy frequency
>> > ntwx=250000, ! Print coordinate frequency
>> > ntwr=500000, ! Print restart file frequency
>> > ! ntwv=-1, ! Uncomment to also print velocities to trajectory
>> > ! ntwf=-1, ! Uncomment to also print forces to trajectory
>> > ntxo=2, ! Write NetCDF format
>> > ioutfm=1, ! Write NetCDF format (always do this!)
>> >
>> > ! Wrap coordinates when printing them to the same unit cell
>> > iwrap=1,
>> >
>> > ! Constant pressure control. Note that ntp=3 requires barostat=1
>> > barostat=2, ! Berendsen... change to 2 for MC barostat
>> > ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/
>> surften
>> > pres0=1.01325, ! Target external pressure, in bar
>> > taup=4, ! Berendsen coupling constant (ps)
>> > comp=45, ! compressibility
>> >
>> > ! Constant surface tension (needed for semi-isotropic scaling).
>> > Uncomment
>> > ! for this feature. csurften must be nonzero if ntp=3 above
>> > csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
>> > gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
>> > scaling
>> > ninterface=2, ! Number of interfaces (2 for bilayer)
>> >
>> > ! Set water atom/residue names for SETTLE recognition
>> > watnam='SOL', ! Water residues are named TIP3
>> > owtnm='OW', ! Water oxygens are named OH2
>> > hwtnm1='HW1',
>> > hwtnm2='HW2',
>> > &end
>> > &ewald
>> > vdwmeth = 0,
>> > &end
>> >
>> >
>> > ##################
>> >
>> > and I run like this:
>> >
>> > export CUDA_VISIBLE_DEVICES=0,1
>> > {
>> > echo "rank 0=localhost slot=0:0"
>> > echo "rank 1=localhost slot=0:1"
>> > } > my.rankfile.A
>> > mpirun --report-bindings --rankfile my.rankfile.A -np 2
>> > ${AMBERHOME}/bin/pmemd.cuda.MPI -i $amdp -o ${athis}.out -p
>> this.prmtop -c
>> > ${aprev}.rst -r ${athis}.rst -x ${athis}.mdcrd -inf ${athis}.info -l
>> > ${athis}.log
>> >
>> > ##################
>> >
>> > Thank you,
>> > Chris.
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 15 2017 - 11:30:02 PDT
Custom Search