Re: [AMBER] NaN error on traj and output with AMBER CUDA - strange reproducable error

From: Marek Maly <marek.maly.ujep.cz>
Date: Sat, 22 Jan 2011 14:23:50 +0100

Dear Jason,
first of all thank you very much for your comments !

I apologize for my bad RST format interpretation, I was
assuming format:

xi yi zi vxi vyi vzi

I just quickly verified that there is no NaN in the file and that there is
the same number of rows as is the number of atoms
(without two first and one last record) and did not check the formating on
Amber web page.


Your explanation seems to be logic and clear although there are two
strange things:

a)
You are right, my MDCRD files are in ASCII format, but as I wrote before,
the trajectory
of the 60th part of the simulation prod60_G4malTRI_ANS.mdcrd is OK
including the last frame.
At least I was able to load and visualised all 50 frames in UCSF Chimera
software and
also VMD did not reported any errors. Moreover I did not find any *
characters within this file.
You can download and check it here http://physics.ujep.cz/~mmaly/amber/
(prod60_G4malTRI_ANS.mdcrd)


b)
I agree with you that just setting ig=-1 should not solve the problem,
just postpone it.
That was one reason why I put my question on Amber forum.
But anyway it is a little surprising that this change (write ig=-1 in my
*.in file)
which I did on the start of 60th simulation part (of one of my
verification run) solved the situation
for consequent parts 61,62,63,64,65,66,67 (where 67 is actually in
progress, ig=-1 is valid
for part 60 and all the consequent sim. parts.) where each part has 250
000 1fs time steps.
I would suppose crash during 61 or 62 part.


Anyway thanks for your recommendation regarding iwrap=1. As I am not
interested
about the diffusion phenomena, it is a good solution. The problem is that
I newer experienced this type of errors as I was up to "now" using just
CPUs where the simulations
were a little shorter :)) so this is a brand new phenomenon for me which
appeared with
long simulation times which is in real time possible to achieve with GPUs.
Thanks also for NetCDF MDCRD format recommendation. If standard
visualisation softwares which I use,
(UCSF Chimera, VMD) has no problems with this format, there is no reason
for me to use ASCII anymore.

Best wishes,

     Marek














Dne Sat, 22 Jan 2011 08:11:57 +0100 Jason Swails <jason.swails.gmail.com>
napsal/-a:

> Hello,
>
> My comments are below:
>
> 2011/1/21 Marek Maly <marek.maly.ujep.cz>
>
>>
>> Error: unspecified launch failure launching kernel kClearForces
>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>> STOP PMEMD Terminated Abnormally!
>>
>> This kind of errors might be solved/minimised with Peker approach
>> or with cooling improvement (liquid cooling ...) I guess.
>>
>
> This would not surprise me. These cards run hot.
>
>
>>
>> But I also obtained another more serious and reproducable error.
>>
>> As the result of this error somehow damaged restart file is created at
>> the given moment. This is crucial problem if one is using sequential
>> approach which
>> Peker described in his email.
>>
>> "IOError: Failed result (0) reading coords of atom 76353"
>>
>>
>> Here are records of atoms (76350,76351,76352,76353) from the given file.
>>
>>
>> -1.6426797 -1.2728143 -1.0625879 -0.4333392 -0.1753845 -0.2861279
>> -0.2107036 -0.0146464 0.0331960 -0.9115371 0.3310424 -0.2589435
>> 0.6583743 -0.3566679 -0.3365556 -0.3229646 0.4308853 0.2198239
>> 0.0342459 0.7525844 0.0393094 -0.2630745 -0.2090709 -0.4284029
>>
>> as you can see there is not evident any damage of the above reported
>> atom
>> record (the last row).
>>
>
> No they're not (at least not in prod60.rst). Each atom has x, y, z
> coordinates. The x, y, z coordinates of atom 76353 will be on line
> 38179 of
> your restart file (each line has 6 coordinates, or 2 atoms, and the
> first 2
> lines are information lines with no coordinates on them). Line 38179 of
> your second restart file is
>
> -95.1521215 126.7658197 331.0708224************ 126.3652303 274.3047611
>
> You can see the obvious file damage there. Restart files operate on
> fixed-format numbers. The restart file is F12.7, or xxxx.xxxxxxx.
> Therefore, the representable numbers in this format span from
> -999.9999999
> to 9999.9999999. Given enough time, a water molecule will diffuse the
> 900
> angstroms away in the negative direction and overflow the field allotted
> to
> it in the restart file.
>
> This effect has been discussed many times on the list before. Once this
> damage occurs, it cannot be used anymore. The solution is to set the
> variable iwrap=1 in the &cntrl namelist of your input file. This will
> wrap
> water molecules that leave the primary unit cell back into the unit cell
> on
> the other side.
>
>
>> (VMD is also not able to read this particular RST file)
>>
>> So there are two problems:
>>
>> #1
>> What is wrong with the given RST file prod60_G4malTRI_ANS.rst ?
>>
>
> Described above.
>
>
>>
>> #2
>> Why this strange damage of RST file occurred ?
>>
>
> Also described above.
>
>
>>
>>
>> Maybe two important information on the end:
>>
>> A)
>> When I use ig=-1 on the start of 60th simulation part (changing random
>> seed) all this
>> part is going well including the last RST file (prod60_G4malTRI_ANS.rst)
>> so also
>> 61st simulation part is OK and so on, but if I repeat without ig=-1 as
>> it
>> is in my
>> original *.in file, the error appear exactly the same way, so it is
>> reproducable.
>>
>
> Setting ig=-1 just changed things around enough so the error didn't occur
> immediately. Run a few more simulations and I guarantee you'll see the
> same
> thing.
>
>
>>
>>
>> B)
>> The whole MDCRD trajectory of the 60th sim. part including the last
>> frame
>> is OK.
>>
>
> If you're using the ASCII mdcrd file, I find this quite hard to
> believe. If
> you look for them, I'm sure you'll see fields of *******. If you're
> using
> netcdf files, then NetCDF trajectories do not suffer from this. Based on
> your input file, there is no ioutfm=1 that I could see, so you're writing
> ASCII mdcrd files.
>
> May I suggest you add ioutfm=1 to your input file and write NetCDF
> trajectory files. The advantages of NetCDF are: file size is smaller,
> they're much faster to write, they're much faster to read, they're much
> faster to process, they are limited by machine precision rather than
> fixed
> field widths. The disadvantages: you don't see numbers if you open it
> with
> a text editor, NetCDF needs to compile on your system (but this is
> default
> in the amber build, and I've never seen it fail before).
>
> I hope this helps,
> Jason
>
>
>>
>> Any comments/suggestions are gratefully welcomed !
>>
>> Best wishes,
>>
>> Marek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Fri, 21 Jan 2011 05:04:48 +0100 peker milas <pekermilas.gmail.com>
>> napsal/-a:
>>
>> > Hi all,
>> >
>> > As a matter of fact, even with those bug fixes i observed a very
>> > similar problem. At some point amber11 (fresh installation with all
>> > bug fixes) produced NaN s in restart file. There is in fact a work
>> > around with our GTX 480 card. Method is simply like that; divide the
>> > simulation into smaller time scales and run those smaller simulations
>> > consecutively. Also wait for at least 10 mins for cooling down the
>> > card to its normal temperature. I know this is very weird but it
>> > worked for us. I just wanted to let all people, who has similar
>> > problems, know.
>> >
>> > best
>> > peker milas
>> >
>> > On Thu, Jan 20, 2011 at 7:08 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
>> wrote:
>> >> Hello,
>> >>
>> >> I'm compiling amber 11 with the recent bugfix 12 from the clean
>> source.
>> >> Maybe a day or two, I will see the error is occurring or not.
>> >> By the way, this is the only error from pmemd.cuda and
>> pmemd.cuda.mpi.
>> >> Thank you.
>> >> Bongkeun Kim
>> >>
>> >> Quoting Jason Swails <jason.swails.gmail.com>:
>> >>
>> >>> Hello,
>> >>>
>> >>> While Ross knows this code probably much better than I do, I think
>> he
>> >>> missed
>> >>> something small (but seriously important in this case) regarding
>> your
>> >>> email.
>> >>>
>> >>> The amber11's bugfixes no longer have coincidentally matching
>> bugfixes.
>> >>> That is to say, the Amber11 bug fixes now go up to 12 (you say you
>> >>> applied
>> >>> up to 11).
>> >>>
>> >>> The 12th bugfix addresses these issues when you use a cutoff value
>> > 8
>> >>> (which you are; yours is 10).
>> >>>
>> >>> Apply bugfix 12 and all should be well.
>> >>>
>> >>> Good luck!
>> >>> Jason
>> >>>
>> >>> On Thu, Jan 20, 2011 at 4:14 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
>> >>> wrote:
>> >>>
>> >>>> Hello,
>> >>>>
>> >>>> I got NaN error when I ran pmemd.cuda and pmemd.cuda.mpi about
>> after
>> >>>> 50ns.
>> >>>> The log file is like:
>> >>>>
>> >>>> NSTEP = 1465000 TIME(PS) = 52980.000 TEMP(K) = 358.79
>> PRESS
>> >>>> = 71.4
>> >>>> Etot = -62655.3195 EKtot = 27682.3184 EPtot =
>> >>>> -90337.6379
>> >>>> BOND = 2126.8615 ANGLE = 1531.3712 DIHED =
>> >>>> 1681.7735
>> >>>> 1-4 NB = 8574.2946 1-4 EEL = 1833.2170 VDWAALS =
>> >>>> 8865.3186
>> >>>> EELEC = -114950.4742 EHBOND = 0.0000 RESTRAINT =
>> >>>> 0.0000
>> >>>> EKCMT = 12293.6612 VIRIAL = 11676.7751 VOLUME =
>> >>>> 399930.2222
>> >>>> Density =
>> >>>> 0.9998
>> >>>>
>> >>>>
>> >>>>
>> ------------------------------------------------------------------------------
>> >>>>
>> >>>> wrapping first mol.: -31.3208124120934 0.00000000000000
>> >>>> 0.00000000000000
>> >>>> wrapping first mol.: -31.3208124120934 0.00000000000000
>> >>>> 0.00000000000000
>> >>>>
>> >>>> NSTEP = 1470000 TIME(PS) = 52990.000 TEMP(K) = 362.41
>> PRESS
>> >>>> = 48.4
>> >>>> Etot = -62667.6518 EKtot = 27961.6172 EPtot =
>> >>>> -90629.2690
>> >>>> BOND = 2136.8358 ANGLE = 1550.7648 DIHED =
>> >>>> 1682.5454
>> >>>> 1-4 NB = 8527.4693 1-4 EEL = 1853.5058 VDWAALS =
>> >>>> 8696.1619
>> >>>> EELEC = -115076.5520 EHBOND = 0.0000 RESTRAINT =
>> >>>> 0.0000
>> >>>> EKCMT = 12447.5954 VIRIAL = 12029.4233 VOLUME =
>> >>>> 400265.4168
>> >>>> Density =
>> >>>> 0.9990
>> >>>>
>> >>>>
>> >>>>
>> ------------------------------------------------------------------------------
>> >>>>
>> >>>> wrapping first mol.: NaN
>> NaN
>> >>>> NaN
>> >>>> wrapping first mol.: NaN
>> NaN
>> >>>> NaN
>> >>>>
>> >>>> NSTEP = 1475000 TIME(PS) = 53000.000 TEMP(K) = NaN
>> PRESS
>> >>>> = NaN
>> >>>> Etot = NaN EKtot = NaN EPtot =
>> >>>> NaN
>> >>>> BOND = ************** ANGLE = 585786.5880 DIHED =
>> >>>> 0.0000
>> >>>> 1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
>> >>>> -662.1176
>> >>>> EELEC = NaN EHBOND = 0.0000 RESTRAINT =
>> >>>> 0.0000
>> >>>> EKCMT = 0.0000 VIRIAL = NaN VOLUME =
>> >>>> NaN
>> >>>> Density =
>> >>>> NaN
>> >>>>
>> >>>>
>> >>>>
>> ------------------------------------------------------------------------------
>> >>>>
>> >>>>
>> >>>>
>> >>>> It was really strange. I set up T=325K and this was well
>> maintained in
>> >>>> the beginning but at certain point this temperature was growing up
>> and
>> >>>> finally I got NaN error. When I checked the last rst file before
>> NaN
>> >>>> error, there is no coordinates and velocities for water molecules
>> and
>> >>>> the box size is bigger than the one in the beginning.
>> >>>> +++++++++++++++++++++++++++++++++++++++
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
>> >>>> 0.0000000
>> >>>> 31.3730000 80.7640000 158.3730000 90.0000000 90.0000000
>> >>>> 90.0000000
>> >>>> +++++++++++++++++++++++++++++++++++++++++
>> >>>>
>> >>>> This is the last part of the rst file from the previous run.
>> >>>> ++++++++++++++++++++++++++++
>> >>>> 0.2813319 0.2859586 0.1069026 -0.2630481 0.7645880
>> >>>> 0.1471529
>> >>>> -0.8100536 1.2586927 0.1523881 0.2990605 0.1620192
>> >>>> 0.0976196
>> >>>> -0.0732898 1.1917989 -1.0429825 0.2014995 0.3834629
>> >>>> -0.1202106
>> >>>> 0.0276703 -0.2488241 -0.2628807 -0.2085400 0.4762971
>> >>>> 0.4179272
>> >>>> -0.3814862 -0.2374063 -0.2416039 0.0699310 -0.0610051
>> >>>> -0.1580978
>> >>>> 0.9372542 1.0430179 -0.7452719 0.3271696 -0.9559725
>> >>>> -0.3386399
>> >>>> 0.2260832 0.0151047 0.1283436 1.2348834 -1.0930565
>> >>>> 0.2119684
>> >>>> -0.7740772 0.0938291 0.2359591 0.2605087 0.0407511
>> >>>> -0.3941893
>> >>>> 2.2260764 -0.6258161 0.5861404 -0.4234042 0.2330984
>> >>>> -0.6828126
>> >>>> 85.0975010 80.6688215 55.6648514 90.0000000 90.0000000
>> >>>> 90.0000000
>> >>>> +++++++++++++++++++++++++++++++
>> >>>>
>> >>>> My input file is this:
>> >>>> ++++++++++++++++++++++++
>> >>>> &cntrl
>> >>>> imin = 0, irest = 1, ntx = 5,
>> >>>> ntb = 2, pres0 = 1.0, ntp = 2,
>> >>>> taup = 2.0, iwrap=1,
>> >>>> cut = 10.0, ntr = 0,
>> >>>> ntc = 2, ntf = 2,
>> >>>> tempi = 325.0, temp0 = 325.0,
>> >>>> ntt = 3, gamma_ln = 1.0,
>> >>>> nstlim = 5000000, dt = 0.002,
>> >>>> ntpr = 5000, ntwx = 5000, ntwr = 5000
>> >>>> /
>> >>>> +++++++++++++++++++++++++
>> >>>>
>> >>>> And I use amber 11 with bugfix 11.
>> >>>> Please let me know any idea that helps me to avoid this problem.
>> >>>> Thank you.
>> >>>> Bongkeun Kim
>> >>>> bkim.chem.ucsb.edu
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> AMBER mailing list
>> >>>> AMBER.ambermd.org
>> >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jason M. Swails
>> >>> Quantum Theory Project,
>> >>> University of Florida
>> >>> Ph.D. Graduate Student
>> >>> 352-392-4032
>> >>> _______________________________________________
>> >>> AMBER mailing list
>> >>> AMBER.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 5804
>> > (20110120) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jan 22 2011 - 06:00:05 PST
Custom Search