Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 1 Dec 2011 16:32:09 -0800

A fix is being tested right now... I know what's happening...



On Thu, Dec 1, 2011 at 4:20 PM, Bill Sinko <wsinko.ucsd.edu> wrote:

> Hi Ross,
>
> Thanks for the tips on the input file I've changed these (attached).
>
> I also recompiled amber11 exactly as you specified from the
> Amber11.tar.bz2. There were a few errors in the bugfix of amber11 so
> I attached the file bugfix_log.txt and configure.ref from the
> $AMBERHOME/Ambertools/src/ directory. Nonetheless the test files look
> much better now only a minor difference in 2 tests (.diff attached).
>
> Despite the clean compile I still get this NaN error occuring at the
> first time "wrapping first mol." occurs (turning iwrap off didn't
> help). Please let me know if you think the bugfix patch errors are
> the problem and how to fix these, or if you think it could be
> something else. Also, just to double check you did run this out on
> your machine until the first time "wrapping first mol." occurs
> correct? My simulation has always run fine until the "wrapping first
> mol." line occurs.
>
> Thanks for your help,
>
> Bill
>
>
>
>
>
>
>
> On Wed, Nov 30, 2011 at 8:01 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> > Hi Bill,
> >
> > Based on the output files you sent me you have the following header:
> >
> > |--------------------- INFORMATION ----------------------
> > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > |
> > | Implementation by:
> > | Ross C. Walker (SDSC)
> > | Scott Le Grand (nVIDIA)
> > | Duncan Poole (nVIDIA)
> > |
> > | CAUTION: The CUDA code is currently experimental.
> > | You use it at your own risk. Be sure to
> > | check ALL results carefully.
> > |
> > | Precision model in use:
> > | [SPDP] - Hybrid Single/Double Precision (Default).
> > |
> > |--------------------------------------------------------
> >
> > This suggests you are using some ancient and/or modified version of the
> > PMEMD GPU code. This would explain why your test cases are so messed up
> as
> > well and is almost certainly the cause of all your problems. This header
> > should read:
> >
> > |--------------------- INFORMATION ----------------------
> > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > | Version 2.2
> > |
> > | 08/16/2011
> > |
> > |
> > | Implementation by:
> > | Ross C. Walker (SDSC)
> > | Scott Le Grand (nVIDIA)
> > | Duncan Poole (nVIDIA)
> > |
> > | CAUTION: The CUDA code is currently experimental.
> > | You use it at your own risk. Be sure to
> > | check ALL results carefully.
> > |
> > | Precision model in use:
> > | [SPDP] - Hybrid Single/Double Precision (Default).
> > |
> > |--------------------------------------------------------
> >
> >
> > Note the VERSON 2.2
> >
> > I tried running with a fully patched version of AMBER 11 and I can't
> > reproduce your problem. Also, a few comments on your input file:
> >
> > &cntrl
> >
> > timlim = 999999., nmropt = 0, imin = 0,
> > ntx = 5, irest = 1, ntrx = 1, ntxo = 1,
> > ntpr = 5000, ntwx = 5000, ntwv = 0, ntwe = 0,
> > ioutfm = 1, ntwr = 5000,
> >
> > ntf = 2, ntb = 1,
> > igb = 0,
> > cut = 9, nsnb = 20,
> >
> > nstlim = 25000000, nscm = 2500, iwrap = 1,
> > t = 0.0, dt = 0.002,
> >
> > temp0 = 300.0, tempi = 200.0, tautp=0.5, ig = -1,
> > heat = 0.0, ntt = 1,
> >
> > ntc = 2, tol = 0.00001, jfastw = 0,
> >
> > ibelly=0, ntr=0,
> >
> > &end
> >
> >
> > timlim - Not sure this even does anything anymore.
> > ntwr=5000 - this is probably too frequently for writing a restart - set
> it
> > to around 50000 or more for better performance.
> > nsnb = 20 - you should NOT set this anymore (since AMBER 7 in fact.
> Leave it
> > at the default and let pmemd automatically rebuild the nonbodnd list when
> > needed.)
> >
> > tempi= 200.0 - this is a restart (irest=1) so this is ignored
> > tautp=0.5 - this is a very tight coupling for the berendsen thermostat. I
> > prefer something along the line of 10.0 for production runs.
> > ig=-1 - This is good to set but in your case, irest=1, ntt=1 it has
> > absolutely no effect on the results since random numbers are not used
> here.
> > jfastw=0 - Don't mess with this, just remove it from your mdin file.
> >
> > I suggest completely deleting your amber installation and start again
> from
> > scratch with:
> >
> > tar xvjf Amber11.tar.bz2
> > tar xvjf AmberTools-1.5.tar.bz2
> > cd amber11
> > export AMBERHOME='pwd'
> > wget http://ambermd.org/bugfixes/AmberTools/1.5/bugfix.all
> > patch -p0 < bugfix.all
> > wget http://ambermd.org/bugfixes/11.0/bugfix.all.tar.bz2
> > wget http://ambermd.org/bugfixes/11.0/apply_bugfix.x
> >
> > chmod 700 apply_bugfix.x
> > ./apply_bugfix.x $AMBERHOME/bugfix.all.tar.bz2
> >
> > Then rebuild everything from scratch and you should be good.
> >
> > All the best
> > Ross
> >
> > All the best
> > Ross
> >
> >> -----Original Message-----
> >> From: Ross Walker [mailto:ross.rosswalker.co.uk]
> >> Sent: Wednesday, November 30, 2011 5:26 PM
> >> To: 'AMBER Mailing List'
> >> Subject: Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"
> >>
> >> Hi Bill,
> >>
> >> Can you please send me all of the files I need to reproduce this on my
> >> own
> >> machine. I.e. the prmtop, inpcrd and mdin file.
> >>
> >> Looking at your test diff file is VERY concerning though. It looks like
> >> the
> >> patch didn't apply properly since you seem to have tons of weird
> >> differences
> >> in the test cases.
> >>
> >> All the best
> >> Ross
> >>
> >> > -----Original Message-----
> >> > From: Bill Sinko [mailto:wsinko.ucsd.edu]
> >> > Sent: Wednesday, November 30, 2011 3:55 PM
> >> > To: amber.ambermd.org
> >> > Subject: [AMBER] CUDA NaN error occuring at "wrapping first mol"
> >> >
> >> > I have noticed an error occuring in pmemd.cuda at the first instance
> >> > of the words "wrapping first mol." in larger systems (~60,000 atoms)
> >> > in the mdout file. After this error the restart and coordinate file
> >> > are filled with NaN. This same system has been run out to 160ns with
> >> > no error using pmemd.mpi. I have run smaller systems 10,000 to
> >> 15,000
> >> > atoms out to 1 microsecond with the same pmemd.cuda executable on the
> >> > same computer with no error. "wrapping first mol" is output numerous
> >> > times in the smaller system but does not cause problems.
> >> >
> >> > I am running pmemd.cuda amber11 with the latest bugfixes up to 19,
> >> > cudatoolkit 4.0.17, I have 2 quad core Intel(R) Xeon(R) CPU X5472 .
> >> > 3.00GHz. This error was seen using a GTX570, a GTX580 (3gb memory),
> >> > and a Tesla C2050 (running on a seperate computer with 2 quad core
> >> > Intel(R) Xeon(R) CPU W5580 . 3.20GHz)
> >> >
> >> > The error is reproducible in that it occurs at the exact same time
> >> > given the same ig value and occurs as soon as the "wrapping first
> >> > mol." occurs given a random ig value. Turning iwrap off does not fix
> >> > the problem and it will occur at the same time point as the "wrapping
> >> > first mol" occured with iwrap on.
> >> >
> >> > Below is the input file, and first instance of error in the output
> >> > file. I also attached the test log file from when I compiled. Any
> >> > help is much appreciated.
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Bill
> >> >
> >> >
> >> > &cntrl
> >> >
> >> > timlim = 999999., nmropt = 0, imin = 0,
> >> > ntx = 5, irest = 1, ntrx = 1, ntxo = 1,
> >> > ntpr = 5000, ntwx = 5000, ntwv = 0, ntwe = 0,
> >> > ioutfm = 1, ntwr = 5000,
> >> >
> >> > ntf = 2, ntb = 1,
> >> > igb = 0,
> >> > cut = 9, nsnb = 20,
> >> >
> >> > nstlim = 25000000, nscm = 2500, iwrap = 1,
> >> > t = 0.0, dt = 0.002,
> >> >
> >> > temp0 = 300.0, tempi = 200.0, tautp=0.5, ig = -1,
> >> > heat = 0.0, ntt = 1,
> >> >
> >> > ntc = 2, tol = 0.00001, jfastw = 0,
> >> >
> >> > ibelly=0, ntr=0,
> >> >
> >> > &end
> >> >
> >> >
> >> > Everything is fine until here and the words "wrapping first mol."
> >> have
> >> > not occured yet here is the mdout when the error starts.
> >> >
> >> >
> >> > NSTEP = 345000 TIME(PS) = 6470.000 TEMP(K) = 300.63 PRESS
> >> =
> >> > 0.0
> >> > Etot = -142135.0768 EKtot = 36637.2070 EPtot = -
> >> > 178772.2838
> >> > BOND = 1471.6496 ANGLE = 3908.0882 DIHED =
> >> > 5198.8718
> >> > 1-4 NB = 1769.8981 1-4 EEL = 17988.7091 VDWAALS =
> >> > 19870.0613
> >> > EELEC = -228979.5620 EHBOND = 0.0000 RESTRAINT =
> >> > 0.0000
> >> > --------------------------------------------------------------------
> >> --
> >> > --------
> >> >
> >> > check COM velocity, temp: 0.000021 0.00(Removed)
> >> > check COM velocity, temp: NaN NaN(Removed)
> >> > wrapping first mol.: NaN NaN NaN
> >> > wrapping first mol.: NaN NaN NaN
> >> >
> >> > NSTEP = 350000 TIME(PS) = 6480.000 TEMP(K) = NaN PRESS
> >> =
> >> > 0.0
> >> > Etot = NaN EKtot = NaN EPtot = -
> >> > 124095.8889
> >> > BOND = 0.0000 ANGLE = 955000.0368 DIHED =
> >> > 0.0000
> >> > 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
> >> -
> >> > 1374.3325
> >> > EELEC = -1077721.5932 EHBOND = 0.0000 RESTRAINT =
> >> > 0.0000
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > William Sinko
> >> >
> >> > Biomedical Sciences Graduate Student
> >> >
> >> > Professor J. Andrew McCammon Group
> >> >
> >> > Howard Hughes Medical Institute
> >> >
> >> > University of California, San Diego
> >> > 9500 Gilman Drive
> >> > La Jolla, California 92093
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> --
> William Sinko
>
> Biomedical Sciences Graduate Student
>
> Professor J. Andrew McCammon Group
>
> Howard Hughes Medical Institute
>
> University of California, San Diego
> 9500 Gilman Drive
> La Jolla, California 92093
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Dec 01 2011 - 17:00:02 PST
Custom Search