Re: [AMBER] NaN error in .rst files

From: Marek Maly <marek.maly.ujep.cz>
Date: Wed, 26 Jan 2011 13:19:45 +0100

Hi Peker and All,

I recently also obtained NaN problem in one of my simulation using
pmemd.cuda on GTX 470:
see - relevant data below. Unfortunately it is unreproducable as I am
using ig=-1.
This crash was appeared relatively close to the start of the simulation
just during first 4th ns.
First thing what I tried to do was that I used RST file from the previous
simulation period to restart
the simulation run and it seems to be ok (now I am another 2 ns (1e6
steps) away from that critical point).


Unfortunately this kind of error is not probably simple random error due
to heating etc.
like for example sooner reported:

--------------------------------
Error: unspecified launch failure launching kernel kClearForces
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
STOP PMEMD Terminated Abnormally!
--------------------------------

because if I am not wrong, Peker was able to reproduce it.

So however this is not the systematic solution (simulate with ig=-1 and in
case of problem, just
to restart the simulation from some previous RST file), it may help.

I have to say that this NaN error seems to be very rare, significantly
more rare than above
random one "...kClearForces..." which is after all also relatively
infrequent.

Here are some data from that crashed part of simulation.


#in file
---------------

&cntrl
   imin=0,irest=1,ntx=5,
   nstlim=250000,dt=0.002,
   ntc=2,ntf=2,
   cut=10.0, ntb=2, ntp=1, taup=1.0,
   ntpr=5000, ntwx=5000,
   ntt=3, gamma_ln=2.0, ig=-1,
     temp0=298,
  /

---------------

#part of out file
-------------------

NSTEP = 5000 TIME(PS) = 3710.000 TEMP(K) = 298.54 PRESS =
19.7
  Etot = -208138.5945 EKtot = 55606.0703 EPtot =
-263744.6648
  BOND = 12770.3337 ANGLE = 5327.0090 DIHED =
1796.1712
  1-4 NB = 1220.6297 1-4 EEL = 32614.9838 VDWAALS =
44968.0029
  EELEC = -362441.7951 EHBOND = 0.0000 RESTRAINT =
0.0000
  EKCMT = 25890.2734 VIRIAL = 25498.7611 VOLUME =
920970.7237
                                                     Density =
1.0185
  ------------------------------------------------------------------------------


  NSTEP = 10000 TIME(PS) = 3720.000 TEMP(K) = 296.22 PRESS =
73.8
  Etot = -208072.6867 EKtot = 55172.7422 EPtot =
-263245.4289
  BOND = 12797.2804 ANGLE = 5280.3633 DIHED =
1793.7009
  1-4 NB = 1236.9252 1-4 EEL = 32530.0890 VDWAALS =
45147.6190
  EELEC = -362031.4066 EHBOND = 0.0000 RESTRAINT =
0.0000
  EKCMT = 25827.2976 VIRIAL = 24357.5606 VOLUME =
921836.9062
                                                     Density =
1.0176
  ------------------------------------------------------------------------------


  NSTEP = 15000 TIME(PS) = 3730.000 TEMP(K) = 299.32 PRESS =
133.8
  Etot = -207140.0864 EKtot = 55750.0508 EPtot =
-262890.1372
  BOND = 12813.3463 ANGLE = 5316.1231 DIHED =
1777.2270
  1-4 NB = 1214.8564 1-4 EEL = 32504.7227 VDWAALS =
45134.9078
  EELEC = -361651.3206 EHBOND = 0.0000 RESTRAINT =
0.0000
  EKCMT = 26061.3868 VIRIAL = 23397.2429 VOLUME =
922304.1794
                                                     Density =
1.0171
  ------------------------------------------------------------------------------


  NSTEP = 20000 TIME(PS) = 3740.000 TEMP(K) = NaN PRESS
= NaN
  Etot = NaN EKtot = NaN EPtot
= NaN
  BOND = 0.0000 ANGLE = 795708.3759 DIHED =
0.0000
  1-4 NB = 0.0000 1-4 EEL = 0.0000 VDWAALS =
-1408.8764
  EELEC = NaN EHBOND = 0.0000 RESTRAINT =
0.0000
  EKCMT = 0.0000 VIRIAL = NaN VOLUME
= NaN
                                                     Density
= NaN
  ------------------------------------------------------------------------------

#first part of RST file
-------------------------------

92246 0.4200000E+04
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN

-------------------------------

#last part of RST file
------------------------------

          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN NaN NaN NaN
          NaN NaN NaN 109.4712190 109.4712190 109.4712190

-------------------------------


Best wishes,

      Marek





Dne Tue, 25 Jan 2011 23:30:53 +0100 peker milas <pekermilas.gmail.com>
napsal/-a:

> Hello again,
>
> I made 3 consecutive runs (500ps each) with iwrap=1 but unfortunately
> the last one created an .rst file with full of NaN s in it.
> That is being said, only remaining candidate is temperature. I will
> try to fix it according to your examples and advices. I will let you
> know about results.
>
> all the best
> peker
>
> On Tue, Jan 25, 2011 at 10:09 AM, filip fratev <filipfratev.yahoo.com>
> wrote:
>> Hi Ross,
>> Thank you for the detailed explanation! I was surprised about the
>> factory overclock but I am not an expert. Do you think that 3GB version
>> of GTX580 will produce any problems?
>> I never did GPU overclock under Linux. However I think it is sometime
>> helpful just for testing propose, anyway..
>>
>> My personal opinion is that 100% fan speed on GTX470 produce very very
>> high noise, and it will be helpful someone to share some experience and
>> define the temperature level that is “safety” for CUDA calculations.
>> Probably I did not understand well the point about differences in the
>> GTX470 temperatures. I thought that the problem is just the fan
>> control. To be clear I will share my data:
>>
>> 100% fan speed –max 65C
>> 80% -70% fan speed –max 73C-80C
>> 40% fan speed – 90C+
>>
>> For me 75%-80% it is ok regarding to the noise level.
>>
>> All the best,
>> Filip
>>
>>
>>
>> --- On Tue, 1/25/11, Ross Walker <ross.rosswalker.co.uk> wrote:
>>
>>> From: Ross Walker <ross.rosswalker.co.uk>
>>> Subject: Re: [AMBER] NaN error in .rst files
>>> To: "'AMBER Mailing List'" <amber.ambermd.org>
>>> Date: Tuesday, January 25, 2011, 5:10 PM
>>> Hi Peker and others,
>>>
>>> > your temperature…and this really can be problem.
>>> Unfortunately, Nvidia
>>> > continue to irritate the people and don’t provide
>>> overcklock possibility for
>>> > Fermi (even in the new beta drivers 270.13), but
>>> fortunately there is a fan
>>> > control option. You just need to write in your Xorg
>>> file, just below device
>>>
>>> I think I have said this already but I will say it again.
>>> Please please please (pretty please?) do NOT overclock any
>>> of your graphics cards, GTX or Tesla if you want to run
>>> AMBER (or any other simulations for that matter). Gaining
>>> yourself a few % speedup is not worth it given you just end
>>> up wasting everybody's time trying to trackdown bugs that do
>>> not exist. This also applies to buying manufacturer
>>> overclocked versions of GTX cards. The reason you overclock
>>> graphics cards so much is that when running graphics they
>>> are much more fault tolerant than CPUs. For example read and
>>> write errors in the memory only translate to the odd
>>> graphical glitch which likely goes unnoticed. During a MD
>>> calculation, however, this can be disastrous.
>>>
>>> People do not overclock clusters and supercomputers for a
>>> reason and you should NOT overclock your GPU for the same
>>> reason. Certainly as a reviewer of manuscripts I would not
>>> accept one for publication if the calculations had been done
>>> on overclocked GPU hardware since that would throw doubt on
>>> the entire set of simulations.
>>>
>>> > Probably you know that the default GTX470 fan speed is
>>> ONLY 40%. I run my,
>>> > during Amber calculations, with fan speed between
>>> 75-80% in order to keep
>>> > the temperature up to 70C, but some times it reach the
>>> level of 80C. Thus if
>>>
>>> With regards to the fan speed my recommendation is that you
>>> should set ALL fans, both the GPU and case / CPU fans to
>>> 100% (you can often set CPU and case fans to performance
>>> mode in the bios to achieve this) if you are running
>>> calculations on a machine. It will also benefit from an air
>>> conditioned room or office.
>>>
>>> All the best
>>> Ross
>>>
>>> /\
>>> \/
>>> |\oss Walker
>>>
>>> ---------------------------------------------------------
>>> |
>>> Assistant Research Professor
>>> |
>>> | San Diego
>>> Supercomputer Center
>>> |
>>> |
>>> Adjunct Assistant Professor
>>> |
>>> | Dept. of Chemistry
>>> and Biochemistry
>>> |
>>> | University of
>>> California San Diego
>>> |
>>> |
>>> NVIDIA Fellow
>>>
>>> |
>>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
>>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk
>>> |
>>> ---------------------------------------------------------
>>>
>>> Note: Electronic Mail is not secure, has no guarantee of
>>> delivery, may not be read every day, and should not be used
>>> for urgent or sensitive issues.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 5818
> (20110125) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jan 26 2011 - 04:30:02 PST
Custom Search