Re: [AMBER] cuda lauch time out error in Amber 11 from Scott Le Grand on 2010-08-26 (Amber Archive Aug 2010)

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Thu, 26 Aug 2010 18:21:29 -0700

PS if a C1060 repros, it's a bug too. But if it doesn't, it's possible that the architectural differences alone could conceal it.

-----Original Message-----
From: Sasha Buzko [mailto:obuzko.ucla.edu]
Sent: Thursday, August 26, 2010 18:08
To: AMBER Mailing List
Subject: Re: [AMBER] cuda lauch time out error in Amber 11

Hi Sergio,
the initial post about a C1060 producing this error was incorrect (I
think I mentioned that in a later exchange with Scott).

As far as the error is concerned, I haven't seen a system small enough
yet to avoid it. I get it every few ns even on 20000 atoms (explicit
solvent). So I don't think it's a lack of memory issue.

With that, it would be worthwhile to run a GB simulation of a short
peptide and see whether a tiny system produces this error on a GTX470/480.

Hopefully, Scott can offer a more qualified opinion on this subject.

Sasha

Sergio R Aragon wrote:
> Hi Sasha,
>
> Thanks for your reply. In previous messages you exchanged with Scott Le Grand, you indicated that the error had occurred on a Tesla C1060. Here's the exchange:
> From: Sasha Buzko <obuzko.ucla.edu>
> Date: Tue, 08 Jun 2010 10:17:47 -0700
> Actually, it did happen on C1060 as well. Just the latest error came
> when testing on a GTX480..
> Scott Le Grand wrote:
>
>> This is not happening on your C1060 chips, is it?
>>
>>
>
> Did you change your mind about that?
>
> You apparently think that the error is purely random and unrelated to system size being run by blaming it on lack of ECC. This seems rather inconsistent with being able to run for 6-7 days on a 25,000 atom system and not see the error. You may be right - perhaps my statistics are not good enough. What's the largest number of atoms you've run on your GTX480 without seeing the error? Or have you seen it regardless of system size if you run for long enough?
>
> Your experience with these systems is a valuable clue for the rest of us.
> Thanks, Sergio
>
>
> -----Original Message-----
> From: Sasha Buzko [mailto:obuzko.ucla.edu]
> Sent: Thursday, August 26, 2010 5:26 PM
> To: AMBER Mailing List
> Subject: Re: [AMBER] cuda lauch time out error in Amber 11
>
> Sergio,
> I get the same error on a GTX480. It has nothing to do with the X server
> or power issues, as had been suggested before. The deviceQuery failure
> shouldn't be associated with the X server, since I've run it under
> runlevel 3 with no incident. Sometimes you need to run the query as
> root, and then try it again as the regular user, and it works (no idea why).
>
> My guess is that your error message is an inherent problem with a GTX470
> (a consumer card, just like GTX480), since it's never happened to a
> Tesla C1060 (in my experience). I haven't tested the latest Tesla cards
> (C2050 series), but my guess is that random memory errors without ECC
> cause the calculations to bail out, while such events would go unnoticed
> in a gaming/video environment.
> In my case, I'm just using a workaround in the job script to catch the
> error output and rerun the job. And waiting to upgrade to Teslas.
>
> Scott and Ross also might have more informed advice on this one.
>
> Sasha
>
>
> Sergio R Aragon wrote:
>
>> Dear Amber Users,
>>
>> I am using a GTX470 card under Linux RH 4.8. I have installed and tested both the Cuda SDK and Amber 11.
>> The deviceQuery program produces the expected output. This card supports Cuda 2.0, just like the GTX480. There is no Windows OS on the hard drive, only Linux. The host machine is an AMD Phenom II Quad processor with 8 GB ram, and a 650 W power supply. The machine has only one video card, but all my runs are done via remote access without logging in to the console via x-windows. If I set the machine to default run level 3 in the inittab file to prevent X-windows from starting, then the deviceQuery program fails. The driver apparently needs run level 5.
>>
>> I have successfully run about four 20 ns trajectories on a 6 kDa protein with explicit TIP3P water for a total of 25,000 atoms, water included. However, when I attempt to run a larger protein with a total of 63,000 or 68,000 atoms (water included), I get the "Error: the launch timed out and was terminated launching kernel
>> kPMEGetGridWeights" that has been reported here by Wookyung. This error occurs whether I run an NVT or an NPT ensemble, anywhere from 0.1 ns to 1 ns of the run. I have not seen the error during the preliminary energy minimization steps before MD.
>>
>> If pmemd.cuda is allocating more memory than the card has (1.3 GB for the GTX 470, vs about 1.5 GB for the GTX480), then I would expect the run to terminate consistently at the same point every time, but the program terminates at seemingly random times. Restarting the run yields the same behavior - about another ns can be accumulated before the error creeps up again.
>>
>> I plan to increase the number of atoms from 25,000 upwards to see at what size the error begins to generate an idea of what size problems can be done on this card. However, I would like to understand what is going on. Thanks for any feedback.
>>
>> Sergio Aragon/SFSU
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 26 2010 - 18:30:06 PDT