Re: [AMBER] cuda lauch time out error in Amber 11 from Sasha Buzko on 2010-08-26 (Amber Archive Aug 2010)

From: Sasha Buzko <obuzko.ucla.edu>
Date: Thu, 26 Aug 2010 17:26:15 -0700

Sergio,
I get the same error on a GTX480. It has nothing to do with the X server
or power issues, as had been suggested before. The deviceQuery failure
shouldn't be associated with the X server, since I've run it under
runlevel 3 with no incident. Sometimes you need to run the query as
root, and then try it again as the regular user, and it works (no idea why).

My guess is that your error message is an inherent problem with a GTX470
(a consumer card, just like GTX480), since it's never happened to a
Tesla C1060 (in my experience). I haven't tested the latest Tesla cards
(C2050 series), but my guess is that random memory errors without ECC
cause the calculations to bail out, while such events would go unnoticed
in a gaming/video environment.
In my case, I'm just using a workaround in the job script to catch the
error output and rerun the job. And waiting to upgrade to Teslas.

Scott and Ross also might have more informed advice on this one.

Sasha

Sergio R Aragon wrote:
> Dear Amber Users,
>
> I am using a GTX470 card under Linux RH 4.8. I have installed and tested both the Cuda SDK and Amber 11.
> The deviceQuery program produces the expected output. This card supports Cuda 2.0, just like the GTX480. There is no Windows OS on the hard drive, only Linux. The host machine is an AMD Phenom II Quad processor with 8 GB ram, and a 650 W power supply. The machine has only one video card, but all my runs are done via remote access without logging in to the console via x-windows. If I set the machine to default run level 3 in the inittab file to prevent X-windows from starting, then the deviceQuery program fails. The driver apparently needs run level 5.
>
> I have successfully run about four 20 ns trajectories on a 6 kDa protein with explicit TIP3P water for a total of 25,000 atoms, water included. However, when I attempt to run a larger protein with a total of 63,000 or 68,000 atoms (water included), I get the "Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights" that has been reported here by Wookyung. This error occurs whether I run an NVT or an NPT ensemble, anywhere from 0.1 ns to 1 ns of the run. I have not seen the error during the preliminary energy minimization steps before MD.
>
> If pmemd.cuda is allocating more memory than the card has (1.3 GB for the GTX 470, vs about 1.5 GB for the GTX480), then I would expect the run to terminate consistently at the same point every time, but the program terminates at seemingly random times. Restarting the run yields the same behavior - about another ns can be accumulated before the error creeps up again.
>
> I plan to increase the number of atoms from 25,000 upwards to see at what size the error begins to generate an idea of what size problems can be done on this card. However, I would like to understand what is going on. Thanks for any feedback.
>
> Sergio Aragon/SFSU
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 26 2010 - 18:00:03 PDT