Re: [AMBER] NaN error on traj and output with AMBER CUDA - strange reproducable error

From: Marek Maly <marek.maly.ujep.cz>
Date: Sat, 22 Jan 2011 04:05:40 +0100

Hi All,

I can just confirm some occasional errors on GTX 470 using Amber 11 with
all bugfixes including 12 applied.

One of that error which seems appear randomly and is not
reproducable is this one:


Error: unspecified launch failure launching kernel kClearForces
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
STOP PMEMD Terminated Abnormally!

This kind of errors might be solved/minimised with Peker approach
or with cooling improvement (liquid cooling ...) I guess.

But I also obtained another more serious and reproducable error.

As the result of this error somehow damaged restart file is created at
the given moment. This is crucial problem if one is using sequential
approach which
Peker described in his email.

My simulation is divided into parts where each part has 250 000 time steps.
Once one part is done, the consequent part starts from the previous
restart file.

If the last restart file of the previous simul. part is damaged the
consequent part of simulation doesn't start.

In one particular simulation (here are all the important files
http://physics.ujep.cz/~mmaly/amber/ )
this situation appeared on the end of 60th part. As you can see in
prod60_G4malTRI_ANS.out this
part which was started from prod59_G4malTRI_ANS.rst was OK but if you look
into prod61_G4malTRI_ANS.out
you can see there "ERROR: Could not read coords from
prod60_G4malTRI_ANS.rst".

When I tried to open prod60_G4malTRI_ANS.rst in "UCSF Chimera" I got an
error:


"IOError: Failed result (0) reading coords of atom 76353"


Here are records of atoms (76350,76351,76352,76353) from the given file.


   -1.6426797 -1.2728143 -1.0625879 -0.4333392 -0.1753845 -0.2861279
   -0.2107036 -0.0146464 0.0331960 -0.9115371 0.3310424 -0.2589435
    0.6583743 -0.3566679 -0.3365556 -0.3229646 0.4308853 0.2198239
    0.0342459 0.7525844 0.0393094 -0.2630745 -0.2090709 -0.4284029

as you can see there is not evident any damage of the above reported atom
record (the last row).

(VMD is also not able to read this particular RST file)

So there are two problems:

#1
What is wrong with the given RST file prod60_G4malTRI_ANS.rst ?

#2
Why this strange damage of RST file occurred ?


Maybe two important information on the end:

A)
When I use ig=-1 on the start of 60th simulation part (changing random
seed) all this
part is going well including the last RST file (prod60_G4malTRI_ANS.rst)
so also
61st simulation part is OK and so on, but if I repeat without ig=-1 as it
is in my
original *.in file, the error appear exactly the same way, so it is
reproducable.


B)
The whole MDCRD trajectory of the 60th sim. part including the last frame
is OK.


Any comments/suggestions are gratefully welcomed !

Best wishes,

   Marek











Dne Fri, 21 Jan 2011 05:04:48 +0100 peker milas <pekermilas.gmail.com>
napsal/-a:

> Hi all,
>
> As a matter of fact, even with those bug fixes i observed a very
> similar problem. At some point amber11 (fresh installation with all
> bug fixes) produced NaN s in restart file. There is in fact a work
> around with our GTX 480 card. Method is simply like that; divide the
> simulation into smaller time scales and run those smaller simulations
> consecutively. Also wait for at least 10 mins for cooling down the
> card to its normal temperature. I know this is very weird but it
> worked for us. I just wanted to let all people, who has similar
> problems, know.
>
> best
> peker milas
>
> On Thu, Jan 20, 2011 at 7:08 PM, Bongkeun Kim <bkim.chem.ucsb.edu> wrote:
>> Hello,
>>
>> I'm compiling amber 11 with the recent bugfix 12 from the clean source.
>> Maybe a day or two, I will see the error is occurring or not.
>> By the way, this is the only error from pmemd.cuda and pmemd.cuda.mpi.
>> Thank you.
>> Bongkeun Kim
>>
>> Quoting Jason Swails <jason.swails.gmail.com>:
>>
>>> Hello,
>>>
>>> While Ross knows this code probably much better than I do, I think he
>>> missed
>>> something small (but seriously important in this case) regarding your
>>> email.
>>>
>>> The amber11's bugfixes no longer have coincidentally matching bugfixes.
>>> That is to say, the Amber11 bug fixes now go up to 12 (you say you
>>> applied
>>> up to 11).
>>>
>>> The 12th bugfix addresses these issues when you use a cutoff value > 8
>>> (which you are; yours is 10).
>>>
>>> Apply bugfix 12 and all should be well.
>>>
>>> Good luck!
>>> Jason
>>>
>>> On Thu, Jan 20, 2011 at 4:14 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I got NaN error when I ran pmemd.cuda and pmemd.cuda.mpi about after
>>>> 50ns.
>>>> The log file is like:
>>>>
>>>> NSTEP = 1465000 TIME(PS) = 52980.000 TEMP(K) = 358.79 PRESS
>>>> = 71.4
>>>> Etot = -62655.3195 EKtot = 27682.3184 EPtot =
>>>> -90337.6379
>>>> BOND = 2126.8615 ANGLE = 1531.3712 DIHED =
>>>> 1681.7735
>>>> 1-4 NB = 8574.2946 1-4 EEL = 1833.2170 VDWAALS =
>>>> 8865.3186
>>>> EELEC = -114950.4742 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 12293.6612 VIRIAL = 11676.7751 VOLUME =
>>>> 399930.2222
>>>> Density =
>>>> 0.9998
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> wrapping first mol.: -31.3208124120934 0.00000000000000
>>>> 0.00000000000000
>>>> wrapping first mol.: -31.3208124120934 0.00000000000000
>>>> 0.00000000000000
>>>>
>>>> NSTEP = 1470000 TIME(PS) = 52990.000 TEMP(K) = 362.41 PRESS
>>>> = 48.4
>>>> Etot = -62667.6518 EKtot = 27961.6172 EPtot =
>>>> -90629.2690
>>>> BOND = 2136.8358 ANGLE = 1550.7648 DIHED =
>>>> 1682.5454
>>>> 1-4 NB = 8527.4693 1-4 EEL = 1853.5058 VDWAALS =
>>>> 8696.1619
>>>> EELEC = -115076.5520 EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 12447.5954 VIRIAL = 12029.4233 VOLUME =
>>>> 400265.4168
>>>> Density =
>>>> 0.9990
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> wrapping first mol.: NaN NaN
>>>> NaN
>>>> wrapping first mol.: NaN NaN
>>>> NaN
>>>>
>>>> NSTEP = 1475000 TIME(PS) = 53000.000 TEMP(K) = NaN PRESS
>>>> = NaN
>>>> Etot = NaN EKtot = NaN EPtot =
>>>> NaN
>>>> BOND = ************** ANGLE = 585786.5880 DIHED =
>>>> 0.0000
>>>> 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
>>>> -662.1176
>>>> EELEC = NaN EHBOND = 0.0000 RESTRAINT =
>>>> 0.0000
>>>> EKCMT = 0.0000 VIRIAL = NaN VOLUME =
>>>> NaN
>>>> Density =
>>>> NaN
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> It was really strange. I set up T=325K and this was well maintained in
>>>> the beginning but at certain point this temperature was growing up and
>>>> finally I got NaN error. When I checked the last rst file before NaN
>>>> error, there is no coordinates and velocities for water molecules and
>>>> the box size is bigger than the one in the beginning.
>>>> +++++++++++++++++++++++++++++++++++++++
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>>>> 0.0000000
>>>> 31.3730000 80.7640000 158.3730000 90.0000000 90.0000000
>>>> 90.0000000
>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>
>>>> This is the last part of the rst file from the previous run.
>>>> ++++++++++++++++++++++++++++
>>>> 0.2813319 0.2859586 0.1069026 -0.2630481 0.7645880
>>>> 0.1471529
>>>> -0.8100536 1.2586927 0.1523881 0.2990605 0.1620192
>>>> 0.0976196
>>>> -0.0732898 1.1917989 -1.0429825 0.2014995 0.3834629
>>>> -0.1202106
>>>> 0.0276703 -0.2488241 -0.2628807 -0.2085400 0.4762971
>>>> 0.4179272
>>>> -0.3814862 -0.2374063 -0.2416039 0.0699310 -0.0610051
>>>> -0.1580978
>>>> 0.9372542 1.0430179 -0.7452719 0.3271696 -0.9559725
>>>> -0.3386399
>>>> 0.2260832 0.0151047 0.1283436 1.2348834 -1.0930565
>>>> 0.2119684
>>>> -0.7740772 0.0938291 0.2359591 0.2605087 0.0407511
>>>> -0.3941893
>>>> 2.2260764 -0.6258161 0.5861404 -0.4234042 0.2330984
>>>> -0.6828126
>>>> 85.0975010 80.6688215 55.6648514 90.0000000 90.0000000
>>>> 90.0000000
>>>> +++++++++++++++++++++++++++++++
>>>>
>>>> My input file is this:
>>>> ++++++++++++++++++++++++
>>>> &cntrl
>>>> imin = 0, irest = 1, ntx = 5,
>>>> ntb = 2, pres0 = 1.0, ntp = 2,
>>>> taup = 2.0, iwrap=1,
>>>> cut = 10.0, ntr = 0,
>>>> ntc = 2, ntf = 2,
>>>> tempi = 325.0, temp0 = 325.0,
>>>> ntt = 3, gamma_ln = 1.0,
>>>> nstlim = 5000000, dt = 0.002,
>>>> ntpr = 5000, ntwx = 5000, ntwr = 5000
>>>> /
>>>> +++++++++++++++++++++++++
>>>>
>>>> And I use amber 11 with bugfix 11.
>>>> Please let me know any idea that helps me to avoid this problem.
>>>> Thank you.
>>>> Bongkeun Kim
>>>> bkim.chem.ucsb.edu
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>>
>>> --
>>> Jason M. Swails
>>> Quantum Theory Project,
>>> University of Florida
>>> Ph.D. Graduate Student
>>> 352-392-4032
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 5804
> (20110120) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jan 21 2011 - 19:30:02 PST
Custom Search