Re: [AMBER] NaN error on traj and output with AMBER CUDA - strange reproducable error

From: Jason Swails <jason.swails.gmail.com>
Date: Sat, 22 Jan 2011 02:11:57 -0500

Hello,

My comments are below:

2011/1/21 Marek Maly <marek.maly.ujep.cz>

>
> Error: unspecified launch failure launching kernel kClearForces
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> STOP PMEMD Terminated Abnormally!
>
> This kind of errors might be solved/minimised with Peker approach
> or with cooling improvement (liquid cooling ...) I guess.
>

This would not surprise me. These cards run hot.


>
> But I also obtained another more serious and reproducable error.
>
> As the result of this error somehow damaged restart file is created at
> the given moment. This is crucial problem if one is using sequential
> approach which
> Peker described in his email.
>
> "IOError: Failed result (0) reading coords of atom 76353"
>
>
> Here are records of atoms (76350,76351,76352,76353) from the given file.
>
>
> -1.6426797 -1.2728143 -1.0625879 -0.4333392 -0.1753845 -0.2861279
> -0.2107036 -0.0146464 0.0331960 -0.9115371 0.3310424 -0.2589435
> 0.6583743 -0.3566679 -0.3365556 -0.3229646 0.4308853 0.2198239
> 0.0342459 0.7525844 0.0393094 -0.2630745 -0.2090709 -0.4284029
>
> as you can see there is not evident any damage of the above reported atom
> record (the last row).
>

No they're not (at least not in prod60.rst). Each atom has x, y, z
coordinates. The x, y, z coordinates of atom 76353 will be on line 38179 of
your restart file (each line has 6 coordinates, or 2 atoms, and the first 2
lines are information lines with no coordinates on them). Line 38179 of
your second restart file is

-95.1521215 126.7658197 331.0708224************ 126.3652303 274.3047611

You can see the obvious file damage there. Restart files operate on
fixed-format numbers. The restart file is F12.7, or xxxx.xxxxxxx.
Therefore, the representable numbers in this format span from -999.9999999
to 9999.9999999. Given enough time, a water molecule will diffuse the 900
angstroms away in the negative direction and overflow the field allotted to
it in the restart file.

This effect has been discussed many times on the list before. Once this
damage occurs, it cannot be used anymore. The solution is to set the
variable iwrap=1 in the &cntrl namelist of your input file. This will wrap
water molecules that leave the primary unit cell back into the unit cell on
the other side.


> (VMD is also not able to read this particular RST file)
>
> So there are two problems:
>
> #1
> What is wrong with the given RST file prod60_G4malTRI_ANS.rst ?
>

Described above.


>
> #2
> Why this strange damage of RST file occurred ?
>

Also described above.


>
>
> Maybe two important information on the end:
>
> A)
> When I use ig=-1 on the start of 60th simulation part (changing random
> seed) all this
> part is going well including the last RST file (prod60_G4malTRI_ANS.rst)
> so also
> 61st simulation part is OK and so on, but if I repeat without ig=-1 as it
> is in my
> original *.in file, the error appear exactly the same way, so it is
> reproducable.
>

Setting ig=-1 just changed things around enough so the error didn't occur
immediately. Run a few more simulations and I guarantee you'll see the same
thing.


>
>
> B)
> The whole MDCRD trajectory of the 60th sim. part including the last frame
> is OK.
>

If you're using the ASCII mdcrd file, I find this quite hard to believe. If
you look for them, I'm sure you'll see fields of *******. If you're using
netcdf files, then NetCDF trajectories do not suffer from this. Based on
your input file, there is no ioutfm=1 that I could see, so you're writing
ASCII mdcrd files.

May I suggest you add ioutfm=1 to your input file and write NetCDF
trajectory files. The advantages of NetCDF are: file size is smaller,
they're much faster to write, they're much faster to read, they're much
faster to process, they are limited by machine precision rather than fixed
field widths. The disadvantages: you don't see numbers if you open it with
a text editor, NetCDF needs to compile on your system (but this is default
in the amber build, and I've never seen it fail before).

I hope this helps,
Jason


>
> Any comments/suggestions are gratefully welcomed !
>
> Best wishes,
>
> Marek
>
>
>
>
>
>
>
>
>
>
>
> Dne Fri, 21 Jan 2011 05:04:48 +0100 peker milas <pekermilas.gmail.com>
> napsal/-a:
>
> > Hi all,
> >
> > As a matter of fact, even with those bug fixes i observed a very
> > similar problem. At some point amber11 (fresh installation with all
> > bug fixes) produced NaN s in restart file. There is in fact a work
> > around with our GTX 480 card. Method is simply like that; divide the
> > simulation into smaller time scales and run those smaller simulations
> > consecutively. Also wait for at least 10 mins for cooling down the
> > card to its normal temperature. I know this is very weird but it
> > worked for us. I just wanted to let all people, who has similar
> > problems, know.
> >
> > best
> > peker milas
> >
> > On Thu, Jan 20, 2011 at 7:08 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
> wrote:
> >> Hello,
> >>
> >> I'm compiling amber 11 with the recent bugfix 12 from the clean source.
> >> Maybe a day or two, I will see the error is occurring or not.
> >> By the way, this is the only error from pmemd.cuda and pmemd.cuda.mpi.
> >> Thank you.
> >> Bongkeun Kim
> >>
> >> Quoting Jason Swails <jason.swails.gmail.com>:
> >>
> >>> Hello,
> >>>
> >>> While Ross knows this code probably much better than I do, I think he
> >>> missed
> >>> something small (but seriously important in this case) regarding your
> >>> email.
> >>>
> >>> The amber11's bugfixes no longer have coincidentally matching bugfixes.
> >>> That is to say, the Amber11 bug fixes now go up to 12 (you say you
> >>> applied
> >>> up to 11).
> >>>
> >>> The 12th bugfix addresses these issues when you use a cutoff value > 8
> >>> (which you are; yours is 10).
> >>>
> >>> Apply bugfix 12 and all should be well.
> >>>
> >>> Good luck!
> >>> Jason
> >>>
> >>> On Thu, Jan 20, 2011 at 4:14 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
> >>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I got NaN error when I ran pmemd.cuda and pmemd.cuda.mpi about after
> >>>> 50ns.
> >>>> The log file is like:
> >>>>
> >>>> NSTEP = 1465000 TIME(PS) = 52980.000 TEMP(K) = 358.79 PRESS
> >>>> = 71.4
> >>>> Etot = -62655.3195 EKtot = 27682.3184 EPtot =
> >>>> -90337.6379
> >>>> BOND = 2126.8615 ANGLE = 1531.3712 DIHED =
> >>>> 1681.7735
> >>>> 1-4 NB = 8574.2946 1-4 EEL = 1833.2170 VDWAALS =
> >>>> 8865.3186
> >>>> EELEC = -114950.4742 EHBOND = 0.0000 RESTRAINT =
> >>>> 0.0000
> >>>> EKCMT = 12293.6612 VIRIAL = 11676.7751 VOLUME =
> >>>> 399930.2222
> >>>> Density =
> >>>> 0.9998
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>>
> >>>> wrapping first mol.: -31.3208124120934 0.00000000000000
> >>>> 0.00000000000000
> >>>> wrapping first mol.: -31.3208124120934 0.00000000000000
> >>>> 0.00000000000000
> >>>>
> >>>> NSTEP = 1470000 TIME(PS) = 52990.000 TEMP(K) = 362.41 PRESS
> >>>> = 48.4
> >>>> Etot = -62667.6518 EKtot = 27961.6172 EPtot =
> >>>> -90629.2690
> >>>> BOND = 2136.8358 ANGLE = 1550.7648 DIHED =
> >>>> 1682.5454
> >>>> 1-4 NB = 8527.4693 1-4 EEL = 1853.5058 VDWAALS =
> >>>> 8696.1619
> >>>> EELEC = -115076.5520 EHBOND = 0.0000 RESTRAINT =
> >>>> 0.0000
> >>>> EKCMT = 12447.5954 VIRIAL = 12029.4233 VOLUME =
> >>>> 400265.4168
> >>>> Density =
> >>>> 0.9990
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>>
> >>>> wrapping first mol.: NaN NaN
> >>>> NaN
> >>>> wrapping first mol.: NaN NaN
> >>>> NaN
> >>>>
> >>>> NSTEP = 1475000 TIME(PS) = 53000.000 TEMP(K) = NaN PRESS
> >>>> = NaN
> >>>> Etot = NaN EKtot = NaN EPtot =
> >>>> NaN
> >>>> BOND = ************** ANGLE = 585786.5880 DIHED =
> >>>> 0.0000
> >>>> 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
> >>>> -662.1176
> >>>> EELEC = NaN EHBOND = 0.0000 RESTRAINT =
> >>>> 0.0000
> >>>> EKCMT = 0.0000 VIRIAL = NaN VOLUME =
> >>>> NaN
> >>>> Density =
> >>>> NaN
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>> It was really strange. I set up T=325K and this was well maintained in
> >>>> the beginning but at certain point this temperature was growing up and
> >>>> finally I got NaN error. When I checked the last rst file before NaN
> >>>> error, there is no coordinates and velocities for water molecules and
> >>>> the box size is bigger than the one in the beginning.
> >>>> +++++++++++++++++++++++++++++++++++++++
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> >>>> 0.0000000
> >>>> 31.3730000 80.7640000 158.3730000 90.0000000 90.0000000
> >>>> 90.0000000
> >>>> +++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>> This is the last part of the rst file from the previous run.
> >>>> ++++++++++++++++++++++++++++
> >>>> 0.2813319 0.2859586 0.1069026 -0.2630481 0.7645880
> >>>> 0.1471529
> >>>> -0.8100536 1.2586927 0.1523881 0.2990605 0.1620192
> >>>> 0.0976196
> >>>> -0.0732898 1.1917989 -1.0429825 0.2014995 0.3834629
> >>>> -0.1202106
> >>>> 0.0276703 -0.2488241 -0.2628807 -0.2085400 0.4762971
> >>>> 0.4179272
> >>>> -0.3814862 -0.2374063 -0.2416039 0.0699310 -0.0610051
> >>>> -0.1580978
> >>>> 0.9372542 1.0430179 -0.7452719 0.3271696 -0.9559725
> >>>> -0.3386399
> >>>> 0.2260832 0.0151047 0.1283436 1.2348834 -1.0930565
> >>>> 0.2119684
> >>>> -0.7740772 0.0938291 0.2359591 0.2605087 0.0407511
> >>>> -0.3941893
> >>>> 2.2260764 -0.6258161 0.5861404 -0.4234042 0.2330984
> >>>> -0.6828126
> >>>> 85.0975010 80.6688215 55.6648514 90.0000000 90.0000000
> >>>> 90.0000000
> >>>> +++++++++++++++++++++++++++++++
> >>>>
> >>>> My input file is this:
> >>>> ++++++++++++++++++++++++
> >>>> &cntrl
> >>>> imin = 0, irest = 1, ntx = 5,
> >>>> ntb = 2, pres0 = 1.0, ntp = 2,
> >>>> taup = 2.0, iwrap=1,
> >>>> cut = 10.0, ntr = 0,
> >>>> ntc = 2, ntf = 2,
> >>>> tempi = 325.0, temp0 = 325.0,
> >>>> ntt = 3, gamma_ln = 1.0,
> >>>> nstlim = 5000000, dt = 0.002,
> >>>> ntpr = 5000, ntwx = 5000, ntwr = 5000
> >>>> /
> >>>> +++++++++++++++++++++++++
> >>>>
> >>>> And I use amber 11 with bugfix 11.
> >>>> Please let me know any idea that helps me to avoid this problem.
> >>>> Thank you.
> >>>> Bongkeun Kim
> >>>> bkim.chem.ucsb.edu
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Jason M. Swails
> >>> Quantum Theory Project,
> >>> University of Florida
> >>> Ph.D. Graduate Student
> >>> 352-392-4032
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 5804
> > (20110120) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jan 21 2011 - 23:30:03 PST
Custom Search