Hi All,
I can just confirm some occasional errors on GTX 470 using Amber 11 with
all bugfixes including 12 applied.
One of that error which seems appear randomly and is not
reproducable is this one:
Error: unspecified launch failure launching kernel kClearForces
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
STOP PMEMD Terminated Abnormally!
This kind of errors might be solved/minimised with Peker approach
or with cooling improvement (liquid cooling ...) I guess.
But I also obtained another more serious and reproducable  error.
As the result of this error somehow damaged restart file is created at
the given moment. This is crucial problem if one is using sequential  
approach which
Peker described in his email.
My simulation is divided into parts where each part has 250 000 time steps.
Once one part is done, the consequent part starts from the previous  
restart file.
If the last restart file of the previous simul. part is damaged the  
consequent part of simulation doesn't start.
In one particular simulation (here are all the important files  
http://physics.ujep.cz/~mmaly/amber/ )
this situation appeared on the end of 60th part. As you can see in  
prod60_G4malTRI_ANS.out this
part which was started from prod59_G4malTRI_ANS.rst was OK but if you look  
into prod61_G4malTRI_ANS.out
you can see there "ERROR:   Could not read coords from  
prod60_G4malTRI_ANS.rst".
When I tried to open prod60_G4malTRI_ANS.rst in "UCSF Chimera" I got an  
error:
"IOError: Failed result (0) reading coords of atom 76353"
Here are records of atoms (76350,76351,76352,76353) from the given file.
   -1.6426797  -1.2728143  -1.0625879  -0.4333392  -0.1753845  -0.2861279
   -0.2107036  -0.0146464   0.0331960  -0.9115371   0.3310424  -0.2589435
    0.6583743  -0.3566679  -0.3365556  -0.3229646   0.4308853   0.2198239
    0.0342459   0.7525844   0.0393094  -0.2630745  -0.2090709  -0.4284029
as you can see there is not evident any damage of the above reported atom  
record (the last row).
(VMD is also not able to read this particular RST file)
So there are two problems:
#1
What is wrong with the given RST file prod60_G4malTRI_ANS.rst ?
#2
Why this strange damage of RST file occurred ?
Maybe two important information on the end:
A)
When I use ig=-1 on the start of 60th simulation part (changing random  
seed) all this
part is going well including the last RST file (prod60_G4malTRI_ANS.rst)  
so also
61st simulation part is OK and so on, but if I repeat without ig=-1 as it  
is in my
original *.in file, the error appear exactly the same way, so it is  
reproducable.
B)
The whole MDCRD trajectory of the 60th sim. part including the last frame  
is OK.
Any comments/suggestions are gratefully welcomed !
Best wishes,
   Marek
Dne Fri, 21 Jan 2011 05:04:48 +0100 peker milas <pekermilas.gmail.com>  
napsal/-a:
> Hi all,
>
> As a matter of fact, even with those bug fixes i observed a very
> similar problem. At some point amber11 (fresh installation with all
> bug fixes) produced NaN s in restart file. There is in fact a work
> around with our GTX 480 card. Method is simply like that; divide the
> simulation into smaller time scales and run those smaller simulations
> consecutively. Also wait for at least 10 mins for cooling down the
> card to its normal temperature. I know this is very weird but it
> worked for us. I just wanted to let all people, who has similar
> problems, know.
>
> best
> peker milas
>
> On Thu, Jan 20, 2011 at 7:08 PM, Bongkeun Kim <bkim.chem.ucsb.edu> wrote:
>> Hello,
>>
>> I'm compiling amber 11 with the recent bugfix 12 from the clean source.
>> Maybe a day or two, I will see the error is occurring or not.
>> By the way, this is the only error from pmemd.cuda and pmemd.cuda.mpi.
>> Thank you.
>> Bongkeun Kim
>>
>> Quoting Jason Swails <jason.swails.gmail.com>:
>>
>>> Hello,
>>>
>>> While Ross knows this code probably much better than I do, I think he  
>>> missed
>>> something small (but seriously important in this case) regarding your  
>>> email.
>>>
>>> The amber11's bugfixes no longer have coincidentally matching bugfixes.
>>> That is to say, the Amber11 bug fixes now go up to 12 (you say you  
>>> applied
>>> up to 11).
>>>
>>> The 12th bugfix addresses these issues when you use a cutoff value > 8
>>> (which you are; yours is 10).
>>>
>>> Apply bugfix 12 and all should be well.
>>>
>>> Good luck!
>>> Jason
>>>
>>> On Thu, Jan 20, 2011 at 4:14 PM, Bongkeun Kim <bkim.chem.ucsb.edu>  
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I got NaN error when I ran pmemd.cuda and pmemd.cuda.mpi about after  
>>>> 50ns.
>>>> The log file is like:
>>>>
>>>>  NSTEP =  1465000   TIME(PS) =   52980.000  TEMP(K) =   358.79  PRESS
>>>> =    71.4
>>>>  Etot   =    -62655.3195  EKtot   =     27682.3184  EPtot      =
>>>> -90337.6379
>>>>  BOND   =      2126.8615  ANGLE   =      1531.3712  DIHED      =
>>>> 1681.7735
>>>>  1-4 NB =      8574.2946  1-4 EEL =      1833.2170  VDWAALS    =
>>>> 8865.3186
>>>>  EELEC  =   -114950.4742  EHBOND  =         0.0000  RESTRAINT  =
>>>>    0.0000
>>>>  EKCMT  =     12293.6612  VIRIAL  =     11676.7751  VOLUME     =
>>>> 399930.2222
>>>>                                                     Density    =
>>>>    0.9998
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>  wrapping first mol.:  -31.3208124120934        0.00000000000000
>>>>   0.00000000000000
>>>>  wrapping first mol.:  -31.3208124120934        0.00000000000000
>>>>   0.00000000000000
>>>>
>>>>  NSTEP =  1470000   TIME(PS) =   52990.000  TEMP(K) =   362.41  PRESS
>>>> =    48.4
>>>>  Etot   =    -62667.6518  EKtot   =     27961.6172  EPtot      =
>>>> -90629.2690
>>>>  BOND   =      2136.8358  ANGLE   =      1550.7648  DIHED      =
>>>> 1682.5454
>>>>  1-4 NB =      8527.4693  1-4 EEL =      1853.5058  VDWAALS    =
>>>> 8696.1619
>>>>  EELEC  =   -115076.5520  EHBOND  =         0.0000  RESTRAINT  =
>>>>    0.0000
>>>>  EKCMT  =     12447.5954  VIRIAL  =     12029.4233  VOLUME     =
>>>> 400265.4168
>>>>                                                     Density    =
>>>>    0.9990
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>  wrapping first mol.:                     NaN                     NaN
>>>>                     NaN
>>>>  wrapping first mol.:                     NaN                     NaN
>>>>                     NaN
>>>>
>>>>  NSTEP =  1475000   TIME(PS) =   53000.000  TEMP(K) =      NaN  PRESS
>>>> =     NaN
>>>>  Etot   =            NaN  EKtot   =            NaN  EPtot      =
>>>>       NaN
>>>>  BOND   = **************  ANGLE   =    585786.5880  DIHED      =
>>>>    0.0000
>>>>  1-4 NB =         0.0000  1-4 EEL =         0.0000  VDWAALS    =
>>>> -662.1176
>>>>  EELEC  =            NaN  EHBOND  =         0.0000  RESTRAINT  =
>>>>    0.0000
>>>>  EKCMT  =         0.0000  VIRIAL  =            NaN  VOLUME     =
>>>>       NaN
>>>>                                                     Density    =
>>>>       NaN
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> It was really strange. I set up T=325K and this was well maintained in
>>>> the beginning but at certain point this temperature was growing up and
>>>> finally I got NaN error. When I checked the last rst file before NaN
>>>> error, there is no coordinates and velocities for water molecules and
>>>> the box size is bigger than the one in the beginning.
>>>> +++++++++++++++++++++++++++++++++++++++
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>    0.0000000   0.0000000   0.0000000   0.0000000   0.0000000    
>>>> 0.0000000
>>>>   31.3730000  80.7640000 158.3730000  90.0000000  90.0000000  
>>>>  90.0000000
>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>
>>>> This is the last part of the rst file from the previous run.
>>>> ++++++++++++++++++++++++++++
>>>>    0.2813319   0.2859586   0.1069026  -0.2630481   0.7645880    
>>>> 0.1471529
>>>>   -0.8100536   1.2586927   0.1523881   0.2990605   0.1620192    
>>>> 0.0976196
>>>>   -0.0732898   1.1917989  -1.0429825   0.2014995   0.3834629  
>>>>  -0.1202106
>>>>    0.0276703  -0.2488241  -0.2628807  -0.2085400   0.4762971    
>>>> 0.4179272
>>>>   -0.3814862  -0.2374063  -0.2416039   0.0699310  -0.0610051  
>>>>  -0.1580978
>>>>    0.9372542   1.0430179  -0.7452719   0.3271696  -0.9559725  
>>>>  -0.3386399
>>>>    0.2260832   0.0151047   0.1283436   1.2348834  -1.0930565    
>>>> 0.2119684
>>>>   -0.7740772   0.0938291   0.2359591   0.2605087   0.0407511  
>>>>  -0.3941893
>>>>    2.2260764  -0.6258161   0.5861404  -0.4234042   0.2330984  
>>>>  -0.6828126
>>>>   85.0975010  80.6688215  55.6648514  90.0000000  90.0000000  
>>>>  90.0000000
>>>> +++++++++++++++++++++++++++++++
>>>>
>>>> My input file is this:
>>>> ++++++++++++++++++++++++
>>>>  &cntrl
>>>>   imin = 0, irest = 1, ntx = 5,
>>>>   ntb = 2, pres0 = 1.0, ntp = 2,
>>>>   taup = 2.0, iwrap=1,
>>>>   cut = 10.0, ntr = 0,
>>>>   ntc = 2, ntf = 2,
>>>>   tempi = 325.0, temp0 = 325.0,
>>>>   ntt = 3, gamma_ln = 1.0,
>>>>   nstlim = 5000000, dt = 0.002,
>>>>   ntpr = 5000, ntwx = 5000, ntwr = 5000
>>>>  /
>>>> +++++++++++++++++++++++++
>>>>
>>>> And I use amber 11 with bugfix 11.
>>>> Please let me know any idea that helps me to avoid this problem.
>>>> Thank you.
>>>> Bongkeun Kim
>>>> bkim.chem.ucsb.edu
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>>
>>> --
>>> Jason M. Swails
>>> Quantum Theory Project,
>>> University of Florida
>>> Ph.D. Graduate Student
>>> 352-392-4032
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 5804  
> (20110120) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jan 21 2011 - 19:30:02 PST