Hi Heather,
If you are running this on 2 GPUs at once - i.e. mpirun -np 2 then this is
almost certainly due to the race condition introduced with the SPFP patch
in bug fix.9. Bugfix.12 will fix this.
If this happens in serial on a single GPU then it will need more
investigation. Does a similar thing happen if you run on CPUs? Does your
system have
phosphates present by any chance? Collapse of hydroxyl protons which have
no VDW radii onto highly charged atoms (often phosphate groups) can cause
NaNs in simulations and they tend to appear randomly. The fix is to add a
very small VDW radii to the hydroxyl protons in the force field. This is
speculation on what is happening though. Ideally you'll need to run with
ntpr=1, ntwx=1 so you can watch what actually happens immediately before
the NAN occurs.
All the best
Ross
On 10/19/12 7:36 AM, "mhclewett.msn.com" <mhclewett.msn.com> wrote:
>
>Hello and thank you in advance for your help,
>I have a NaN error that does not seem to respond to the posted
>fixes/suggestions. I will provide as much information as would be
>helpful. My guess of what is helpful follows.
>I am operating Amber12 on a 2 GPU system that follows the Ross Walker
>recommendation for all hardware. The command nvidia-smi returns a
>temperature of 80 C for GPU0 and 85 C for GPU1.
>My bugfixes are current through bugfix.24.
>I am modeling a system after the TrpCage tutorial and have used the
>TrpCage tutorial as a starting point for input files.
>The error shown is about 30% of the way through heat3.out:
>NSTEP = 2450 TIME(PS) = 11.225 TEMP(K) = 152.87 PRESS =
> 0.0 Etot = -4490.8226 EKtot = 2014.7379 EPtot =
>-6505.5606 BOND = 599.4379 ANGLE = 1830.9888 DIHED
>= 2996.7946 1-4 NB = 1019.0625 1-4 EEL = 21561.0655
>VDWAALS = -2513.1648 EELEC = -23076.1258 EGB =
>-8923.6194 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>
> NSTEP = 2500 TIME(PS) = 11.250 TEMP(K) = 149.42 PRESS =
> 0.0 Etot = -4487.8111 EKtot = 1969.2665 EPtot =
>-6457.0776 BOND = 598.1061 ANGLE = 1847.8162 DIHED
>= 2996.0754 1-4 NB = 1030.0616 1-4 EEL = 21568.1362
>VDWAALS = -2496.5528 EELEC = -23090.1034 EGB =
>-8910.6168 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>
> NSTEP = 2550 TIME(PS) = 11.275 TEMP(K) = NaN PRESS =
> 0.0 Etot = NaN EKtot = NaN EPtot =
>583581.2203 BOND = 0.0000 ANGLE = 643644.5002 DIHED
>= 0.0000 1-4 NB = 0.0000 1-4 EEL = 0.0000
>VDWAALS = 0.0000 EELEC = 0.0000 EGB =
>-60063.2798 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----heat3.out lines 558-580/1842 30%
>and heat3.in looks like this:Stage 1 heating of AB42 dimer 100 to 150K
>&cntrl imin=0, irest=1, ntx=5, nstlim=10000, dt=0.0005, ntc=2, ntf=2,
>ntt=3, gamma_ln=5.0, tempi=100.0, temp0=150.0, ntpr=50, ntwx=50,
>ntb=0, igb=5, ig=-1, cut=999.,rgbmax=999. /
>
>If I modify heat3.in to the following:(all lines the same except...)
>ntpr=1, ntwx=1, nscm=100,
>then I get (again, from a new heat3.out)
> NSTEP = 2326 TIME(PS) = 11.163 TEMP(K) = 153.25 PRESS =
> 0.0 Etot = -4457.1334 EKtot = 2019.7349 EPtot =
>-6476.8683 BOND = 579.0866 ANGLE = 1857.8567 DIHED
>= 2994.7695 1-4 NB = 1034.6359 1-4 EEL = 21582.3349
>VDWAALS = -2535.9145 EELEC = -23099.5555 EGB =
>-8890.0819 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>
> NSTEP = 2327 TIME(PS) = 11.164 TEMP(K) = 153.24 PRESS =
> 0.0 Etot = -4456.8246 EKtot = 2019.5847 EPtot =
>-6476.4093 BOND = 579.5806 ANGLE = 1859.4058 DIHED
>= 2993.8184 1-4 NB = 1034.7352 1-4 EEL = 21581.9544
>VDWAALS = -2535.9831 EELEC = -23099.1063 EGB =
>-8890.8143 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>
> NSTEP = 2328 TIME(PS) = 11.164 TEMP(K) = Infinity PRESS =
> 0.0 Etot = Infinity EKtot = Infinity EPtot =
>-6474.1288 BOND = 580.8156 ANGLE = 1861.9235 DIHED
>= 2993.0305 1-4 NB = 1034.8214 1-4 EEL = 21581.6528
>VDWAALS = -2536.1536 EELEC = -23098.8105 EGB =
>-8891.4086 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>
> NSTEP = 2329 TIME(PS) = 11.165 TEMP(K) = NaN PRESS =
> 0.0 Etot = NaN EKtot = NaN EPtot =
>-5513.0260 BOND = 473.0124 ANGLE = 2960.9029 DIHED
>= 3028.6947 1-4 NB = 1034.4870 1-4 EEL = 21500.3431
>VDWAALS = -2534.0394 EELEC = -22853.5152 EGB =
>-9122.9116 RESTRAINT = 0.0000
>--------------------------------------------------------------------------
>----
>Again, I am happy to provide as much additional info as would be helpful
>and am tremendously grateful for your advice and gift of time in
>responding.
>Heather ClewettChemistry Graduate StudentUniversity of Nevada, Reno
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 19 2012 - 11:30:05 PDT