Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 30 Nov 2011 20:01:28 -0800

Hi Bill,

Based on the output files you sent me you have the following header:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)
|
| CAUTION: The CUDA code is currently experimental.
| You use it at your own risk. Be sure to
| check ALL results carefully.
|
| Precision model in use:
| [SPDP] - Hybrid Single/Double Precision (Default).
|
|--------------------------------------------------------

This suggests you are using some ancient and/or modified version of the
PMEMD GPU code. This would explain why your test cases are so messed up as
well and is almost certainly the cause of all your problems. This header
should read:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
| Version 2.2
|
| 08/16/2011
|
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)
|
| CAUTION: The CUDA code is currently experimental.
| You use it at your own risk. Be sure to
| check ALL results carefully.
|
| Precision model in use:
| [SPDP] - Hybrid Single/Double Precision (Default).
|
|--------------------------------------------------------


Note the VERSON 2.2

I tried running with a fully patched version of AMBER 11 and I can't
reproduce your problem. Also, a few comments on your input file:

&cntrl

  timlim = 999999., nmropt = 0, imin = 0,
  ntx = 5, irest = 1, ntrx = 1, ntxo = 1,
  ntpr = 5000, ntwx = 5000, ntwv = 0, ntwe = 0,
  ioutfm = 1, ntwr = 5000,

  ntf = 2, ntb = 1,
  igb = 0,
  cut = 9, nsnb = 20,

  nstlim = 25000000, nscm = 2500, iwrap = 1,
  t = 0.0, dt = 0.002,

  temp0 = 300.0, tempi = 200.0, tautp=0.5, ig = -1,
  heat = 0.0, ntt = 1,

  ntc = 2, tol = 0.00001, jfastw = 0,

  ibelly=0, ntr=0,

&end


timlim - Not sure this even does anything anymore.
ntwr=5000 - this is probably too frequently for writing a restart - set it
to around 50000 or more for better performance.
nsnb = 20 - you should NOT set this anymore (since AMBER 7 in fact. Leave it
at the default and let pmemd automatically rebuild the nonbodnd list when
needed.)

tempi= 200.0 - this is a restart (irest=1) so this is ignored
tautp=0.5 - this is a very tight coupling for the berendsen thermostat. I
prefer something along the line of 10.0 for production runs.
ig=-1 - This is good to set but in your case, irest=1, ntt=1 it has
absolutely no effect on the results since random numbers are not used here.
jfastw=0 - Don't mess with this, just remove it from your mdin file.

I suggest completely deleting your amber installation and start again from
scratch with:

tar xvjf Amber11.tar.bz2
tar xvjf AmberTools-1.5.tar.bz2
cd amber11
export AMBERHOME='pwd'
wget http://ambermd.org/bugfixes/AmberTools/1.5/bugfix.all
patch -p0 < bugfix.all
wget http://ambermd.org/bugfixes/11.0/bugfix.all.tar.bz2
wget http://ambermd.org/bugfixes/11.0/apply_bugfix.x

chmod 700 apply_bugfix.x
./apply_bugfix.x $AMBERHOME/bugfix.all.tar.bz2

Then rebuild everything from scratch and you should be good.

All the best
Ross

All the best
Ross

> -----Original Message-----
> From: Ross Walker [mailto:ross.rosswalker.co.uk]
> Sent: Wednesday, November 30, 2011 5:26 PM
> To: 'AMBER Mailing List'
> Subject: Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"
>
> Hi Bill,
>
> Can you please send me all of the files I need to reproduce this on my
> own
> machine. I.e. the prmtop, inpcrd and mdin file.
>
> Looking at your test diff file is VERY concerning though. It looks like
> the
> patch didn't apply properly since you seem to have tons of weird
> differences
> in the test cases.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Bill Sinko [mailto:wsinko.ucsd.edu]
> > Sent: Wednesday, November 30, 2011 3:55 PM
> > To: amber.ambermd.org
> > Subject: [AMBER] CUDA NaN error occuring at "wrapping first mol"
> >
> > I have noticed an error occuring in pmemd.cuda at the first instance
> > of the words "wrapping first mol." in larger systems (~60,000 atoms)
> > in the mdout file. After this error the restart and coordinate file
> > are filled with NaN. This same system has been run out to 160ns with
> > no error using pmemd.mpi. I have run smaller systems 10,000 to
> 15,000
> > atoms out to 1 microsecond with the same pmemd.cuda executable on the
> > same computer with no error. "wrapping first mol" is output numerous
> > times in the smaller system but does not cause problems.
> >
> > I am running pmemd.cuda amber11 with the latest bugfixes up to 19,
> > cudatoolkit 4.0.17, I have 2 quad core Intel(R) Xeon(R) CPU X5472 .
> > 3.00GHz. This error was seen using a GTX570, a GTX580 (3gb memory),
> > and a Tesla C2050 (running on a seperate computer with 2 quad core
> > Intel(R) Xeon(R) CPU W5580 . 3.20GHz)
> >
> > The error is reproducible in that it occurs at the exact same time
> > given the same ig value and occurs as soon as the "wrapping first
> > mol." occurs given a random ig value. Turning iwrap off does not fix
> > the problem and it will occur at the same time point as the "wrapping
> > first mol" occured with iwrap on.
> >
> > Below is the input file, and first instance of error in the output
> > file. I also attached the test log file from when I compiled. Any
> > help is much appreciated.
> >
> >
> > Thanks,
> >
> > Bill
> >
> >
> > &cntrl
> >
> > timlim = 999999., nmropt = 0, imin = 0,
> > ntx = 5, irest = 1, ntrx = 1, ntxo = 1,
> > ntpr = 5000, ntwx = 5000, ntwv = 0, ntwe = 0,
> > ioutfm = 1, ntwr = 5000,
> >
> > ntf = 2, ntb = 1,
> > igb = 0,
> > cut = 9, nsnb = 20,
> >
> > nstlim = 25000000, nscm = 2500, iwrap = 1,
> > t = 0.0, dt = 0.002,
> >
> > temp0 = 300.0, tempi = 200.0, tautp=0.5, ig = -1,
> > heat = 0.0, ntt = 1,
> >
> > ntc = 2, tol = 0.00001, jfastw = 0,
> >
> > ibelly=0, ntr=0,
> >
> > &end
> >
> >
> > Everything is fine until here and the words "wrapping first mol."
> have
> > not occured yet here is the mdout when the error starts.
> >
> >
> > NSTEP = 345000 TIME(PS) = 6470.000 TEMP(K) = 300.63 PRESS
> =
> > 0.0
> > Etot = -142135.0768 EKtot = 36637.2070 EPtot = -
> > 178772.2838
> > BOND = 1471.6496 ANGLE = 3908.0882 DIHED =
> > 5198.8718
> > 1-4 NB = 1769.8981 1-4 EEL = 17988.7091 VDWAALS =
> > 19870.0613
> > EELEC = -228979.5620 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > --------------------------------------------------------------------
> --
> > --------
> >
> > check COM velocity, temp: 0.000021 0.00(Removed)
> > check COM velocity, temp: NaN NaN(Removed)
> > wrapping first mol.: NaN NaN NaN
> > wrapping first mol.: NaN NaN NaN
> >
> > NSTEP = 350000 TIME(PS) = 6480.000 TEMP(K) = NaN PRESS
> =
> > 0.0
> > Etot = NaN EKtot = NaN EPtot = -
> > 124095.8889
> > BOND = 0.0000 ANGLE = 955000.0368 DIHED =
> > 0.0000
> > 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
> -
> > 1374.3325
> > EELEC = -1077721.5932 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> >
> >
> >
> >
> > --
> > William Sinko
> >
> > Biomedical Sciences Graduate Student
> >
> > Professor J. Andrew McCammon Group
> >
> > Howard Hughes Medical Institute
> >
> > University of California, San Diego
> > 9500 Gilman Drive
> > La Jolla, California 92093
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Nov 30 2011 - 20:30:02 PST
Custom Search