Re: [AMBER] NaN question

From: Aron Broom <broomsday.gmail.com>
Date: Fri, 19 Oct 2012 10:51:35 -0400

There is a chance this is because of bad memory on your GPU. The GTX cards
can have this problem sometimes. I just posted a reply to someone asking
something about GTX 580s. The important thing is there is a GPU memory
checker made available by the people who make OpenMM (SimTK). I was having
similar problem to what you see on a GTX580 and the memory checker showed a
lot of problems.

I guess a major troubleshooting question here in terms of whether this is a
GPU problem, or your system, is: does the error occur at the same
timestep? If you don't set igb=-1, the temperature random seed should be
the same, and so you'd expect to see the error at the same timestep. If
there is some randomness to it, then it really strongly points to the GPU.

Also, your cards are running pretty hot. The default Nvidia fan settings
are garbage, they'll let your card get up to ~90C (where the controller
indicates it's in the red) and then fan speed is just at 50%. If you
google "coolbits 5" you should be able to come across some information on
how to use the NVIDIA X server settings control panel to manually adjust
your fan speed, and crank it to 75% at least.

~Aron

On Fri, Oct 19, 2012 at 10:36 AM, <mhclewett.msn.com> wrote:

>
> Hello and thank you in advance for your help,
> I have a NaN error that does not seem to respond to the posted
> fixes/suggestions. I will provide as much information as would be helpful.
> My guess of what is helpful follows.
> I am operating Amber12 on a 2 GPU system that follows the Ross Walker
> recommendation for all hardware. The command nvidia-smi returns a
> temperature of 80 C for GPU0 and 85 C for GPU1.
> My bugfixes are current through bugfix.24.
> I am modeling a system after the TrpCage tutorial and have used the
> TrpCage tutorial as a starting point for input files.
> The error shown is about 30% of the way through heat3.out:
> NSTEP = 2450 TIME(PS) = 11.225 TEMP(K) = 152.87 PRESS =
> 0.0 Etot = -4490.8226 EKtot = 2014.7379 EPtot =
> -6505.5606 BOND = 599.4379 ANGLE = 1830.9888 DIHED =
> 2996.7946 1-4 NB = 1019.0625 1-4 EEL = 21561.0655 VDWAALS
> = -2513.1648 EELEC = -23076.1258 EGB = -8923.6194
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
>
> NSTEP = 2500 TIME(PS) = 11.250 TEMP(K) = 149.42 PRESS =
> 0.0 Etot = -4487.8111 EKtot = 1969.2665 EPtot =
> -6457.0776 BOND = 598.1061 ANGLE = 1847.8162 DIHED =
> 2996.0754 1-4 NB = 1030.0616 1-4 EEL = 21568.1362 VDWAALS
> = -2496.5528 EELEC = -23090.1034 EGB = -8910.6168
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
>
> NSTEP = 2550 TIME(PS) = 11.275 TEMP(K) = NaN PRESS =
> 0.0 Etot = NaN EKtot = NaN EPtot =
> 583581.2203 BOND = 0.0000 ANGLE = 643644.5002 DIHED
> = 0.0000 1-4 NB = 0.0000 1-4 EEL = 0.0000
> VDWAALS = 0.0000 EELEC = 0.0000 EGB =
> -60063.2798 RESTRAINT = 0.0000
> ------------------------------------------------------------------------------heat3.out
> lines 558-580/1842 30%
> and heat3.in looks like this:Stage 1 heating of AB42 dimer 100 to 150K
> &cntrl imin=0, irest=1, ntx=5, nstlim=10000, dt=0.0005, ntc=2, ntf=2,
> ntt=3, gamma_ln=5.0, tempi=100.0, temp0=150.0, ntpr=50, ntwx=50, ntb=0,
> igb=5, ig=-1, cut=999.,rgbmax=999. /
>
> If I modify heat3.in to the following:(all lines the same except...)
> ntpr=1, ntwx=1, nscm=100,
> then I get (again, from a new heat3.out)
> NSTEP = 2326 TIME(PS) = 11.163 TEMP(K) = 153.25 PRESS =
> 0.0 Etot = -4457.1334 EKtot = 2019.7349 EPtot =
> -6476.8683 BOND = 579.0866 ANGLE = 1857.8567 DIHED =
> 2994.7695 1-4 NB = 1034.6359 1-4 EEL = 21582.3349 VDWAALS
> = -2535.9145 EELEC = -23099.5555 EGB = -8890.0819
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
>
> NSTEP = 2327 TIME(PS) = 11.164 TEMP(K) = 153.24 PRESS =
> 0.0 Etot = -4456.8246 EKtot = 2019.5847 EPtot =
> -6476.4093 BOND = 579.5806 ANGLE = 1859.4058 DIHED =
> 2993.8184 1-4 NB = 1034.7352 1-4 EEL = 21581.9544 VDWAALS
> = -2535.9831 EELEC = -23099.1063 EGB = -8890.8143
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
>
> NSTEP = 2328 TIME(PS) = 11.164 TEMP(K) = Infinity PRESS =
> 0.0 Etot = Infinity EKtot = Infinity EPtot =
> -6474.1288 BOND = 580.8156 ANGLE = 1861.9235 DIHED =
> 2993.0305 1-4 NB = 1034.8214 1-4 EEL = 21581.6528 VDWAALS
> = -2536.1536 EELEC = -23098.8105 EGB = -8891.4086
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
>
> NSTEP = 2329 TIME(PS) = 11.165 TEMP(K) = NaN PRESS =
> 0.0 Etot = NaN EKtot = NaN EPtot =
> -5513.0260 BOND = 473.0124 ANGLE = 2960.9029 DIHED =
> 3028.6947 1-4 NB = 1034.4870 1-4 EEL = 21500.3431 VDWAALS
> = -2534.0394 EELEC = -22853.5152 EGB = -9122.9116
> RESTRAINT = 0.0000
> ------------------------------------------------------------------------------
> Again, I am happy to provide as much additional info as would be helpful
> and am tremendously grateful for your advice and gift of time in responding.
> Heather ClewettChemistry Graduate StudentUniversity of Nevada, Reno
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 19 2012 - 08:00:05 PDT
Custom Search