Re: [AMBER] NaN question

From: <mhclewett.msn.com>
Date: Sat, 20 Oct 2012 08:43:59 -0700

Hi Ross,
Thank you very much for the response. I am using NVIDIA GeForce 580 cards, which may be part of the issue--I'm still trying to figure out how to follow Aron's recommendations to check for memory problems.
I am running in serial on a single GPU and have continued to experience thischallenge after updating.
Latest patch applied to AmberTools12: 26Latest patch applied to Amber12: 12

I am running systems that are nearly identical to those I've run on our department's cluster using Amber9, and they've never had issues before. I am checking how theyrun on CPU setting within our new GPU system right now, but so far see a clean minimization step. I do not have phosphates.

The latter part of my original post has some output with ntpr =1 and ntpx =1;I don't see an obvious red flag at NSTEP = 2327 for something that seemsunstable (of course, I may be overlooking something...)
Again, your help is most gratefully appreciated.Heather


> Date: Fri, 19 Oct 2012 11:19:04 -0700
> From: ross.rosswalker.co.uk
> To: amber.ambermd.org
> Subject: Re: [AMBER] NaN question
>
> Hi Heather,
>
> If you are running this on 2 GPUs at once - i.e. mpirun -np 2 then this is
> almost certainly due to the race condition introduced with the SPFP patch
> in bug fix.9. Bugfix.12 will fix this.
>
> If this happens in serial on a single GPU then it will need more
> investigation. Does a similar thing happen if you run on CPUs? Does your
> system have
> phosphates present by any chance? Collapse of hydroxyl protons which have
> no VDW radii onto highly charged atoms (often phosphate groups) can cause
> NaNs in simulations and they tend to appear randomly. The fix is to add a
> very small VDW radii to the hydroxyl protons in the force field. This is
> speculation on what is happening though. Ideally you'll need to run with
> ntpr=1, ntwx=1 so you can watch what actually happens immediately before
> the NAN occurs.
>
>
> All the best
> Ross
>
> On 10/19/12 7:36 AM, "mhclewett.msn.com" <mhclewett.msn.com> wrote:
>
> >
> >Hello and thank you in advance for your help,
> >I have a NaN error that does not seem to respond to the posted
> >fixes/suggestions. I will provide as much information as would be
> >helpful. My guess of what is helpful follows.
> >I am operating Amber12 on a 2 GPU system that follows the Ross Walker
> >recommendation for all hardware. The command nvidia-smi returns a
> >temperature of 80 C for GPU0 and 85 C for GPU1.
> >My bugfixes are current through bugfix.24.
> >I am modeling a system after the TrpCage tutorial and have used the
> >TrpCage tutorial as a starting point for input files.
> >The error shown is about 30% of the way through heat3.out:
> >NSTEP = 2450 TIME(PS) = 11.225 TEMP(K) = 152.87 PRESS =
> > 0.0 Etot = -4490.8226 EKtot = 2014.7379 EPtot =
> >-6505.5606 BOND = 599.4379 ANGLE = 1830.9888 DIHED
> >= 2996.7946 1-4 NB = 1019.0625 1-4 EEL = 21561.0655
> >VDWAALS = -2513.1648 EELEC = -23076.1258 EGB =
> >-8923.6194 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >
> > NSTEP = 2500 TIME(PS) = 11.250 TEMP(K) = 149.42 PRESS =
> > 0.0 Etot = -4487.8111 EKtot = 1969.2665 EPtot =
> >-6457.0776 BOND = 598.1061 ANGLE = 1847.8162 DIHED
> >= 2996.0754 1-4 NB = 1030.0616 1-4 EEL = 21568.1362
> >VDWAALS = -2496.5528 EELEC = -23090.1034 EGB =
> >-8910.6168 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >
> > NSTEP = 2550 TIME(PS) = 11.275 TEMP(K) = NaN PRESS =
> > 0.0 Etot = NaN EKtot = NaN EPtot =
> >583581.2203 BOND = 0.0000 ANGLE = 643644.5002 DIHED
> >= 0.0000 1-4 NB = 0.0000 1-4 EEL = 0.0000
> >VDWAALS = 0.0000 EELEC = 0.0000 EGB =
> >-60063.2798 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----heat3.out lines 558-580/1842 30%
> >and heat3.in looks like this:Stage 1 heating of AB42 dimer 100 to 150K
> >&cntrl imin=0, irest=1, ntx=5, nstlim=10000, dt=0.0005, ntc=2, ntf=2,
> >ntt=3, gamma_ln=5.0, tempi=100.0, temp0=150.0, ntpr=50, ntwx=50,
> >ntb=0, igb=5, ig=-1, cut=999.,rgbmax=999. /
> >
> >If I modify heat3.in to the following:(all lines the same except...)
> >ntpr=1, ntwx=1, nscm=100,
> >then I get (again, from a new heat3.out)
> > NSTEP = 2326 TIME(PS) = 11.163 TEMP(K) = 153.25 PRESS =
> > 0.0 Etot = -4457.1334 EKtot = 2019.7349 EPtot =
> >-6476.8683 BOND = 579.0866 ANGLE = 1857.8567 DIHED
> >= 2994.7695 1-4 NB = 1034.6359 1-4 EEL = 21582.3349
> >VDWAALS = -2535.9145 EELEC = -23099.5555 EGB =
> >-8890.0819 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >
> > NSTEP = 2327 TIME(PS) = 11.164 TEMP(K) = 153.24 PRESS =
> > 0.0 Etot = -4456.8246 EKtot = 2019.5847 EPtot =
> >-6476.4093 BOND = 579.5806 ANGLE = 1859.4058 DIHED
> >= 2993.8184 1-4 NB = 1034.7352 1-4 EEL = 21581.9544
> >VDWAALS = -2535.9831 EELEC = -23099.1063 EGB =
> >-8890.8143 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >
> > NSTEP = 2328 TIME(PS) = 11.164 TEMP(K) = Infinity PRESS =
> > 0.0 Etot = Infinity EKtot = Infinity EPtot =
> >-6474.1288 BOND = 580.8156 ANGLE = 1861.9235 DIHED
> >= 2993.0305 1-4 NB = 1034.8214 1-4 EEL = 21581.6528
> >VDWAALS = -2536.1536 EELEC = -23098.8105 EGB =
> >-8891.4086 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >
> > NSTEP = 2329 TIME(PS) = 11.165 TEMP(K) = NaN PRESS =
> > 0.0 Etot = NaN EKtot = NaN EPtot =
> >-5513.0260 BOND = 473.0124 ANGLE = 2960.9029 DIHED
> >= 3028.6947 1-4 NB = 1034.4870 1-4 EEL = 21500.3431
> >VDWAALS = -2534.0394 EELEC = -22853.5152 EGB =
> >-9122.9116 RESTRAINT = 0.0000
> >--------------------------------------------------------------------------
> >----
> >Again, I am happy to provide as much additional info as would be helpful
> >and am tremendously grateful for your advice and gift of time in
> >responding.
> >Heather ClewettChemistry Graduate StudentUniversity of Nevada, Reno
> >
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
                                               
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Oct 20 2012 - 09:00:03 PDT
Custom Search