Re: [AMBER] Etot and TEMP become NaN during a simulation on commodity GPUs

From: Chris Neale <candrewn.gmail.com>
Date: Tue, 12 Sep 2017 11:16:47 -0600

Dear Ross:

Thanks for the tips. The system has 158,000 atoms. I will run it from the
previous segment with the same random seed and with a different random seed
and let you know what the results are. The note below (from my previous run
that gave NaN's) has me concerned as to whether the "disabling the
synchronization of random numbers" will affect the reproducibility, but
perhaps this is not an issue:

Note: ig = -1. Setting random seed to 875535 based on wallclock time in
      microseconds and disabling the synchronization of random numbers
      between tasks to improve performance.

I don't mind the occasional glitch on a GPU, but I'm also happy to help
track things down if that is useful for you.

Thank you,
Chris.


On Tue, Sep 12, 2017 at 10:43 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Chris,
>
> How many atoms is your simulation? This issue - a water shoots off to
> infinity can happen if your box is too small for the GPU code. The way the
> list is built if any edge of the box is less than 2 x (cutoff+skinnb) at
> any point during the simulation it causes a failure in the list build and
> the net result is often an atom ends up with nan's as its coordinates - in
> the case of a water with shake it drags the other atoms with it. If you
> want to be sure this is the issue and not an ECC error go back to the
> restart file before the nan's occurred and run again with the same random
> seed and hardware and you should get an identical (bad) restart file.
>
> Unfortunately this is expensive to check for on every step and GPUs do not
> handle nans in the same way CPUs do and so it is only something that is
> checked for when the restart file is loaded to continue a run. At that
> point all you can do is discard the previous run and, if you are very close
> to the limit run again with a different random seed and you might be lucky,
> or switch to NVT, or reduce cut or skinnb or add more water to make your
> system bigger.
>
> That said the pdb output suggest 54K atoms at least so this is a big
> system. So it could just be a random glitch in the GPU - not sure ECC would
> catch it since it doesn't protect the whole computation path. Try repeating
> the simulation identically and see if the error occurs in the same place.
> If it does then this maybe a bug in the code we haven't come across before.
>
> All the best
> Ross
>
> > On Sep 12, 2017, at 12:28 PM, Chris Neale <candrewn.gmail.com> wrote:
> >
> > Dear users:
> >
> > Data corruption seems final, so I am backing up to a previous simulation
> > segment, but I thought I'd report this in case it is useful to anybody.
> >
> > I am running Amber 16 pmemd on 2 GPUs, using a charmm force field with a
> > topology built in gromacs and then ported to amber with parmed. I've run
> > hundreds of microseconds without seeing this type of issue, so my guess
> is
> > that it is not specific to the system or the forcefield. At the moment,
> I'm
> > suspecting that it is one of those rare things that would have been
> caught
> > by ECC if I was using a GPU that supported it (which I am not).
> >
> > During an attempt to restart my simulation, pmemd gives the error:
> >
> > | ERROR: NaN(s) found in input coordinates.
> > This likely means that something went wrong in the previous
> > simulation.
> >
> >
> > And the command:
> > ambpdb -p this.prmtop -c v0.5_5.rst
> >
> > produces output that contains this obvious problem:
> >
> > ATOM 54207 HW1 SOL 3056 5.239 35.425 112.718 1.00 0.00
> > H
> > ATOM 54208 HW2 SOL 3056 5.923 36.752 112.464 1.00 0.00
> > H
> > ATOM 54209 OW SOL 3057 78.934 30.200 23.018 1.00 0.00
> > O
> > ATOM 54210 HW1 SOL 3057 78.261 30.829 23.279 1.00 0.00
> > H
> > ATOM 54211 HW2 SOL 3057 79.017 30.317 22.071 1.00 0.00
> > H
> > ATOM 54212 OW SOL 3058 -nan -nan -nan 1.00 0.00
> > O
> > ATOM 54213 HW1 SOL 3058 -nan -nan -nan 1.00 0.00
> > H
> > ATOM 54214 HW2 SOL 3058 -nan -nan -nan 1.00 0.00
> > H
> > ATOM 54215 OW SOL 3059 49.273 109.879 40.039 1.00 0.00
> > O
> > ATOM 54216 HW1 SOL 3059 48.566 109.877 39.394 1.00 0.00
> > H
> > ATOM 54217 HW2 SOL 3059 50.056 109.653 39.537 1.00 0.00
> > H
> > ATOM 54218 OW SOL 3060 52.061 48.796 41.712 1.00 0.00
> > O
> > ATOM 54219 HW1 SOL 3060 51.608 49.476 41.214 1.00 0.00
> > H
> > ATOM 54220 HW2 SOL 3060 52.978 49.072 41.715 1.00 0.00
> > H
> >
> > Looking back at the previous segment of simulation, I can see where the
> > Etot term popped from a real number to NaN:
> >
> > NSTEP = 30750000 TIME(PS) = 2204999.989 TEMP(K) = 309.78 PRESS =
> > 0.0
> > Etot = -239957.1300 EKtot = 104187.3359 EPtot =
> > -344144.4659
> > BOND = 7635.7078 ANGLE = 26341.4448 DIHED =
> > 23765.9225
> > UB = 10045.0177 IMP = 325.3101 CMAP =
> > -176.9250
> > 1-4 NB = 3598.1690 1-4 EEL = -35454.2960 VDWAALS =
> > 13061.4299
> > EELEC = -393286.2467 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 1544909.6127
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0107
> > ------------------------------------------------------------
> ------------------
> >
> >
> > NSTEP = 31000000 TIME(PS) = 2205999.989 TEMP(K) = 311.05 PRESS =
> > 0.0
> > Etot = -240282.7050 EKtot = 104614.0859 EPtot =
> > -344896.7909
> > BOND = 7570.3922 ANGLE = 26289.0463 DIHED =
> > 23693.8946
> > UB = 9872.5247 IMP = 331.2119 CMAP =
> > -183.0812
> > 1-4 NB = 3607.1803 1-4 EEL = -35542.7377 VDWAALS =
> > 13268.2434
> > EELEC = -393803.4653 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 1546525.2111
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0097
> > ------------------------------------------------------------
> ------------------
> >
> >
> > NSTEP = 31250000 TIME(PS) = 2206999.989 TEMP(K) = NaN PRESS =
> > 0.0
> > Etot = NaN EKtot = NaN EPtot =
> > -344391.5063
> > BOND = 7571.4110 ANGLE = 26315.8830 DIHED =
> > 23781.7767
> > UB = 9897.5359 IMP = 329.2852 CMAP =
> > -170.7033
> > 1-4 NB = 3577.8462 1-4 EEL = -35569.5495 VDWAALS =
> > 12988.7025
> > EELEC = -393113.6939 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 1545119.9333
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0106
> > ------------------------------------------------------------
> ------------------
> >
> >
> > NSTEP = 31500000 TIME(PS) = 2207999.989 TEMP(K) = NaN PRESS =
> > 0.0
> > Etot = NaN EKtot = NaN EPtot =
> > -344914.4568
> > BOND = 7521.9425 ANGLE = 26355.6523 DIHED =
> > 23813.2940
> > UB = 10004.2951 IMP = 324.9560 CMAP =
> > -194.2010
> > 1-4 NB = 3606.3528 1-4 EEL = -35201.7173 VDWAALS =
> > 12831.8700
> > EELEC = -393976.9012 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 1541838.9378
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0127
> >
> >
> >
> > #########################
> >
> > My run parameters are:
> >
> > A NPT simulation for common production-level simulations -- params
> > generally from Charmm-gui + some modifications by CN
> > &cntrl
> > imin=0, ! No minimization
> > irest=1, ! ires=1 for restart and irest=0 for new start
> > ntx=5, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
> > use them
> > ntb=2, ! constant pressure simulation
> >
> > ! Temperature control
> > ntt=3, ! Langevin dynamics
> > gamma_ln=1.0, ! Friction coefficient (ps^-1)
> > temp0=310.0, ! Target temperature
> > tempi=310.0, ! Initial temperature -- has no effect if ntx>3
> >
> > ! Potential energy control
> > cut=12.0, ! nonbonded cutoff, in Angstroms
> > fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8
> and
> > no use of fswitch
> >
> > ! MD settings
> > nstlim=250000000, ! 0.25B steps, 1 us total
> > dt=0.004, ! time step (ps)
> >
> > ! SHAKE
> > ntc=2, ! Constrain bonds containing hydrogen
> > ntf=2, ! Do not calculate forces of bonds containing hydrogen
> >
> > ! Control how often information is printed
> > ntpr=250000, ! Print energy frequency
> > ntwx=250000, ! Print coordinate frequency
> > ntwr=500000, ! Print restart file frequency
> > ! ntwv=-1, ! Uncomment to also print velocities to trajectory
> > ! ntwf=-1, ! Uncomment to also print forces to trajectory
> > ntxo=2, ! Write NetCDF format
> > ioutfm=1, ! Write NetCDF format (always do this!)
> >
> > ! Wrap coordinates when printing them to the same unit cell
> > iwrap=1,
> >
> > ! Constant pressure control. Note that ntp=3 requires barostat=1
> > barostat=2, ! Berendsen... change to 2 for MC barostat
> > ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/
> surften
> > pres0=1.01325, ! Target external pressure, in bar
> > taup=4, ! Berendsen coupling constant (ps)
> > comp=45, ! compressibility
> >
> > ! Constant surface tension (needed for semi-isotropic scaling).
> > Uncomment
> > ! for this feature. csurften must be nonzero if ntp=3 above
> > csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
> > gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
> > scaling
> > ninterface=2, ! Number of interfaces (2 for bilayer)
> >
> > ! Set water atom/residue names for SETTLE recognition
> > watnam='SOL', ! Water residues are named TIP3
> > owtnm='OW', ! Water oxygens are named OH2
> > hwtnm1='HW1',
> > hwtnm2='HW2',
> > &end
> > &ewald
> > vdwmeth = 0,
> > &end
> >
> >
> > ##################
> >
> > and I run like this:
> >
> > export CUDA_VISIBLE_DEVICES=0,1
> > {
> > echo "rank 0=localhost slot=0:0"
> > echo "rank 1=localhost slot=0:1"
> > } > my.rankfile.A
> > mpirun --report-bindings --rankfile my.rankfile.A -np 2
> > ${AMBERHOME}/bin/pmemd.cuda.MPI -i $amdp -o ${athis}.out -p this.prmtop
> -c
> > ${aprev}.rst -r ${athis}.rst -x ${athis}.mdcrd -inf ${athis}.info -l
> > ${athis}.log
> >
> > ##################
> >
> > Thank you,
> > Chris.
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Sep 12 2017 - 10:30:02 PDT
Custom Search