Re: [AMBER] pmemd.cuda error: launch timeout..

From: Sasha Buzko <obuzko.ucla.edu>
Date: Tue, 08 Jun 2010 17:55:23 -0700

Ross, Scott,
I repeated the failed run from its initial restart file, and it worked
fine. Looks like Ross may be correct about the X server, because it is
running on the system with GTX480. I'll keep an eye on our other system
with Tesla cards (C1060), but for now it looks like the problem is
either the X server interference or some power issue, but not the code.

Thanks for you help

Sasha


Ross Walker wrote:
> Hi Scott and Sasha,
>
> Note somebody else on the list saw something similar with their GTX480. I
> tried it on my C2050 for 3 days an could not reproduce the error. They did
> not do a very good job of explaining how their machine was actually setup so
> I ultimately concluded that they were running Xwindows at the same time on
> the GTX480 which made me assume all bets were off since they could easily
> just fire up something that ate a chunk of GPU memory and killed the PMEMD
> job.
>
> However, if you can confirm this is NOT the case here then this probably
> needs looking into more carefully.
>
> All the best
> Ross
>
>
>> -----Original Message-----
>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
>> Behalf Of Scott Le Grand
>> Sent: Tuesday, June 08, 2010 1:24 PM
>> To: AMBER Mailing List
>> Subject: RE: [AMBER] pmemd.cuda error: launch timeout..
>>
>> Second, when this happens again. Try to restart from the last restart.
>>
>> This is important because if it goes beyond where it ostensibly should
>> crash, that means you probably have a cooling/power problem or a flaky
>> GPU. If not, then it's definitely a bug and please email me the quick
>> and easy repro restart file.
>>
>>
>>
>> -----Original Message-----
>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
>> Behalf Of Scott Le Grand
>> Sent: Tuesday, June 08, 2010 11:24
>> To: AMBER Mailing List
>> Subject: RE: [AMBER] pmemd.cuda error: launch timeout..
>>
>> Could you try a run with ntpr=1? Let me know if anything bizarre
>> happens right before this...
>>
>>
>> -----Original Message-----
>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
>> Behalf Of Sasha Buzko
>> Sent: Tuesday, June 08, 2010 10:55
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] pmemd.cuda error: launch timeout..
>>
>> Yes, it is. I use it now for an extended simulation. The error seems to
>> occur almost randomly, sometimes at the beginning, sometimes after 10
>> ns..
>>
>> Scott Le Grand wrote:
>>
>>> Well that's not good...
>>>
>>> This is the same input file and run you sent me previously?
>>>
>>>
>>> -----Original Message-----
>>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
>>>
>> Behalf Of Sasha Buzko
>>
>>> Sent: Tuesday, June 08, 2010 10:18
>>> To: AMBER Mailing List
>>> Subject: Re: [AMBER] pmemd.cuda error: launch timeout..
>>>
>>> Actually, it did happen on C1060 as well. Just the latest error came
>>> when testing on a GTX480..
>>>
>>>
>>> Scott Le Grand wrote:
>>>
>>>
>>>> This is not happening on your C1060 chips, is it?
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org]
>>>>
>> On Behalf Of Sasha Buzko
>>
>>>> Sent: Tuesday, June 08, 2010 09:54
>>>> To: AMBER Mailing List
>>>> Subject: [AMBER] pmemd.cuda error: launch timeout..
>>>>
>>>> Hi all,
>>>> I'm testing pmemd.cuda on a GTX480 with a moderately sized system in
>>>> explicit solvent (~60k atoms). Every once in a while, a run is
>>>> interrupted by this error message:
>>>> "Error: the launch timed out and was terminated launching kernel
>>>> kPMEGetGridWeights". No other error messages are generated.
>>>>
>>>> The same system and input files are used by the cpu version with no
>>>> issues. The process doesn't seem to be running out of memory, and no
>>>> hardware issue appears to be involved.
>>>> Below is the deviceQuery output.
>>>>
>>>> Thanks for any suggestions
>>>>
>>>> Sasha
>>>>
>>>>
>>>> [sasha.redwood release]$ ./deviceQuery
>>>> ./deviceQuery Starting...
>>>>
>>>> CUDA Device Query (Runtime API) version (CUDART static linking)
>>>>
>>>> There is 1 device supporting CUDA
>>>>
>>>> Device 0: "GeForce GTX 280"
>>>> CUDA Driver Version: 3.0
>>>> CUDA Runtime Version: 3.0
>>>> CUDA Capability Major revision number: 1
>>>> CUDA Capability Minor revision number: 3
>>>> Total amount of global memory: 1073020928 bytes
>>>> Number of multiprocessors: 30
>>>> Number of cores: 240
>>>> Total amount of constant memory: 65536 bytes
>>>> Total amount of shared memory per block: 16384 bytes
>>>> Total number of registers available per block: 16384
>>>> Warp size: 32
>>>> Maximum number of threads per block: 512
>>>> Maximum sizes of each dimension of a block: 512 x 512 x 64
>>>> Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
>>>> Maximum memory pitch: 2147483647 bytes
>>>> Texture alignment: 256 bytes
>>>> Clock rate: 1.30 GHz
>>>> Concurrent copy and execution: Yes
>>>> Run time limit on kernels: Yes
>>>> Integrated: No
>>>> Support host page-locked memory mapping: Yes
>>>> Compute mode: Default (multiple
>>>>
>> host
>>
>>>> threads can use this device simultaneously)
>>>>
>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4243455,
>>>>
>> CUDA
>>
>>>> Runtime Version = 3.0, NumDevs = 1, Device = GeForce GTX 280
>>>>
>>>>
>>>> PASSED
>>>>
>>>> Press <Enter> to Quit...
>>>> -----------------------------------------------------------
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>> --------------------------------------------------------------------
>>>>
>> ---------------
>>
>>>> This email message is for the sole use of the intended recipient(s)
>>>>
>> and may contain
>>
>>>> confidential information. Any unauthorized review, use, disclosure
>>>>
>> or distribution
>>
>>>> is prohibited. If you are not the intended recipient, please
>>>>
>> contact the sender by
>>
>>>> reply email and destroy all copies of the original message.
>>>> --------------------------------------------------------------------
>>>>
>> ---------------
>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 08 2010 - 18:00:03 PDT
Custom Search