Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"

From: Bill Sinko <wsinko.ucsd.edu>
Date: Thu, 1 Dec 2011 16:20:30 -0800

Hi Ross,

Thanks for the tips on the input file I've changed these (attached).

I also recompiled amber11 exactly as you specified from the
Amber11.tar.bz2. There were a few errors in the bugfix of amber11 so
I attached the file bugfix_log.txt and configure.ref from the
$AMBERHOME/Ambertools/src/ directory. Nonetheless the test files look
much better now only a minor difference in 2 tests (.diff attached).

Despite the clean compile I still get this NaN error occuring at the
first time "wrapping first mol." occurs (turning iwrap off didn't
help). Please let me know if you think the bugfix patch errors are
the problem and how to fix these, or if you think it could be
something else. Also, just to double check you did run this out on
your machine until the first time "wrapping first mol." occurs
correct? My simulation has always run fine until the "wrapping first
mol." line occurs.

Thanks for your help,

Bill







On Wed, Nov 30, 2011 at 8:01 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Bill,
>
> Based on the output files you sent me you have the following header:
>
> |--------------------- INFORMATION ----------------------
> | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> |
> | Implementation by:
> |                    Ross C. Walker     (SDSC)
> |                    Scott Le Grand     (nVIDIA)
> |                    Duncan Poole       (nVIDIA)
> |
> | CAUTION: The CUDA code is currently experimental.
> |          You use it at your own risk. Be sure to
> |          check ALL results carefully.
> |
> | Precision model in use:
> |      [SPDP] - Hybrid Single/Double Precision (Default).
> |
> |--------------------------------------------------------
>
> This suggests you are using some ancient and/or modified version of the
> PMEMD GPU code. This would explain why your test cases are so messed up as
> well and is almost certainly the cause of all your problems. This header
> should read:
>
> |--------------------- INFORMATION ----------------------
> | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> |                      Version 2.2
> |
> |                      08/16/2011
> |
> |
> | Implementation by:
> |                    Ross C. Walker     (SDSC)
> |                    Scott Le Grand     (nVIDIA)
> |                    Duncan Poole       (nVIDIA)
> |
> | CAUTION: The CUDA code is currently experimental.
> |          You use it at your own risk. Be sure to
> |          check ALL results carefully.
> |
> | Precision model in use:
> |      [SPDP] - Hybrid Single/Double Precision (Default).
> |
> |--------------------------------------------------------
>
>
> Note the VERSON 2.2
>
> I tried running with a fully patched version of AMBER 11 and I can't
> reproduce your problem. Also, a few comments on your input file:
>
> &cntrl
>
>  timlim = 999999., nmropt = 0,       imin = 0,
>  ntx    = 5,       irest  = 1,       ntrx = 1,      ntxo   = 1,
>  ntpr   = 5000,    ntwx   = 5000,     ntwv = 0,      ntwe   = 0,
>  ioutfm = 1,       ntwr   = 5000,
>
>  ntf    = 2,       ntb    = 1,
>  igb    = 0,
>  cut    = 9,    nsnb   = 20,
>
>  nstlim = 25000000,   nscm   = 2500,   iwrap = 1,
>  t      = 0.0,     dt     = 0.002,
>
>  temp0  = 300.0,   tempi  = 200.0,    tautp=0.5,         ig  = -1,
>  heat   = 0.0,     ntt    = 1,
>
>  ntc    = 2,       tol    = 0.00001, jfastw = 0,
>
>  ibelly=0, ntr=0,
>
> &end
>
>
> timlim - Not sure this even does anything anymore.
> ntwr=5000 - this is probably too frequently for writing a restart - set it
> to around 50000 or more for better performance.
> nsnb = 20 - you should NOT set this anymore (since AMBER 7 in fact. Leave it
> at the default and let pmemd automatically rebuild the nonbodnd list when
> needed.)
>
> tempi= 200.0 - this is a restart (irest=1) so this is ignored
> tautp=0.5 - this is a very tight coupling for the berendsen thermostat. I
> prefer something along the line of 10.0 for production runs.
> ig=-1 - This is good to set but in your case, irest=1, ntt=1 it has
> absolutely no effect on the results since random numbers are not used here.
> jfastw=0 - Don't mess with this, just remove it from your mdin file.
>
> I suggest completely deleting your amber installation and start again from
> scratch with:
>
> tar xvjf Amber11.tar.bz2
> tar xvjf AmberTools-1.5.tar.bz2
> cd amber11
> export AMBERHOME='pwd'
> wget http://ambermd.org/bugfixes/AmberTools/1.5/bugfix.all
> patch -p0 < bugfix.all
> wget http://ambermd.org/bugfixes/11.0/bugfix.all.tar.bz2
> wget http://ambermd.org/bugfixes/11.0/apply_bugfix.x
>
> chmod 700 apply_bugfix.x
> ./apply_bugfix.x $AMBERHOME/bugfix.all.tar.bz2
>
> Then rebuild everything from scratch and you should be good.
>
> All the best
> Ross
>
> All the best
> Ross
>
>> -----Original Message-----
>> From: Ross Walker [mailto:ross.rosswalker.co.uk]
>> Sent: Wednesday, November 30, 2011 5:26 PM
>> To: 'AMBER Mailing List'
>> Subject: Re: [AMBER] CUDA NaN error occuring at "wrapping first mol"
>>
>> Hi Bill,
>>
>> Can you please send me all of the files I need to reproduce this on my
>> own
>> machine. I.e. the prmtop, inpcrd and mdin file.
>>
>> Looking at your test diff file is VERY concerning though. It looks like
>> the
>> patch didn't apply properly since you seem to have tons of weird
>> differences
>> in the test cases.
>>
>> All the best
>> Ross
>>
>> > -----Original Message-----
>> > From: Bill Sinko [mailto:wsinko.ucsd.edu]
>> > Sent: Wednesday, November 30, 2011 3:55 PM
>> > To: amber.ambermd.org
>> > Subject: [AMBER] CUDA NaN error occuring at "wrapping first mol"
>> >
>> > I have noticed an error occuring in pmemd.cuda at the first instance
>> > of the words "wrapping first mol." in larger systems (~60,000 atoms)
>> > in the mdout file.  After this error the restart and coordinate file
>> > are filled with NaN.  This same system has been run out to 160ns with
>> > no error using pmemd.mpi.  I have run smaller systems 10,000 to
>> 15,000
>> > atoms out to 1 microsecond with the same pmemd.cuda executable on the
>> > same computer with no error.  "wrapping first mol" is output numerous
>> > times in the smaller system but does not cause problems.
>> >
>> > I am running pmemd.cuda amber11 with the latest bugfixes up to 19,
>> > cudatoolkit 4.0.17, I have 2 quad core Intel(R) Xeon(R) CPU  X5472  .
>> > 3.00GHz.  This error was seen using a GTX570, a GTX580 (3gb memory),
>> > and a  Tesla C2050 (running on a seperate computer with 2 quad core
>> > Intel(R) Xeon(R) CPU W5580  . 3.20GHz)
>> >
>> > The error is reproducible in that it occurs at the exact same time
>> > given the same ig value and occurs as soon as the "wrapping first
>> > mol." occurs given a random ig value. Turning iwrap off does not fix
>> > the problem and it will occur at the same time point as the "wrapping
>> > first mol" occured with iwrap on.
>> >
>> > Below is the input file, and first instance of error in the output
>> > file.  I also attached the test log file from when I compiled.  Any
>> > help is much appreciated.
>> >
>> >
>> > Thanks,
>> >
>> > Bill
>> >
>> >
>> > &cntrl
>> >
>> >  timlim = 999999., nmropt = 0,       imin = 0,
>> >  ntx    = 5,       irest  = 1,       ntrx = 1,      ntxo   = 1,
>> >  ntpr   = 5000,     ntwx   = 5000,     ntwv = 0,      ntwe   = 0,
>> >  ioutfm = 1,       ntwr   = 5000,
>> >
>> >  ntf    = 2,       ntb    = 1,
>> >  igb    = 0,
>> >  cut    = 9,    nsnb   = 20,
>> >
>> >  nstlim = 25000000,   nscm   = 2500,   iwrap = 1,
>> >  t      = 0.0,     dt     = 0.002,
>> >
>> >  temp0  = 300.0,   tempi  = 200.0,    tautp=0.5,         ig  = -1,
>> >  heat   = 0.0,     ntt    = 1,
>> >
>> >  ntc    = 2,       tol    = 0.00001, jfastw = 0,
>> >
>> >  ibelly=0, ntr=0,
>> >
>> > &end
>> >
>> >
>> > Everything is fine until here and the words "wrapping first mol."
>> have
>> > not occured yet here is the mdout when the error starts.
>> >
>> >
>> >  NSTEP =   345000   TIME(PS) =    6470.000  TEMP(K) =   300.63  PRESS
>> =
>> > 0.0
>> >  Etot   =   -142135.0768  EKtot   =     36637.2070  EPtot      =   -
>> > 178772.2838
>> >  BOND   =      1471.6496  ANGLE   =      3908.0882  DIHED      =
>> > 5198.8718
>> >  1-4 NB =      1769.8981  1-4 EEL =     17988.7091  VDWAALS    =
>> > 19870.0613
>> >  EELEC  =   -228979.5620  EHBOND  =         0.0000  RESTRAINT  =
>> > 0.0000
>> >  --------------------------------------------------------------------
>> --
>> > --------
>> >
>> > check COM velocity, temp:        0.000021     0.00(Removed)
>> > check COM velocity, temp:             NaN      NaN(Removed)
>> > wrapping first mol.:            NaN            NaN            NaN
>> > wrapping first mol.:            NaN            NaN            NaN
>> >
>> >  NSTEP =   350000   TIME(PS) =    6480.000  TEMP(K) =      NaN  PRESS
>> =
>> > 0.0
>> >  Etot   =            NaN  EKtot   =            NaN  EPtot      =   -
>> > 124095.8889
>> >  BOND   =         0.0000  ANGLE   =    955000.0368  DIHED      =
>> > 0.0000
>> >  1-4 NB =         0.0000  1-4 EEL =         0.0000  VDWAALS    =
>> -
>> > 1374.3325
>> >  EELEC  =  -1077721.5932  EHBOND  =         0.0000  RESTRAINT  =
>> > 0.0000
>> >
>> >
>> >
>> >
>> > --
>> > William Sinko
>> >
>> > Biomedical Sciences Graduate Student
>> >
>> > Professor J. Andrew McCammon Group
>> >
>> > Howard Hughes Medical Institute
>> >
>> > University of California, San Diego
>> > 9500 Gilman Drive
>> > La Jolla, California 92093
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber



-- 
William Sinko
Biomedical Sciences Graduate Student
Professor J. Andrew McCammon Group
Howard Hughes Medical Institute
University of California, San Diego
9500 Gilman Drive
La Jolla, California 92093








_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Dec 01 2011 - 16:30:02 PST
Custom Search