Re: [AMBER] pmemd.cuda issues on Tesla S2050 from Scott Le Grand on 2011-02-10 (Amber Archive Feb 2011)

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Thu, 10 Feb 2011 07:01:25 -0800

I'll be filing a bug today. Also, is this only happening with cards inside an S2050 enclosure? I've never seen a C2050 do this since bugfix 12.

-----Original Message-----
From: Sasha Buzko [mailto:obuzko.ucla.edu]
Sent: Wednesday, February 09, 2011 09:24
To: AMBER Mailing List
Subject: Re: [AMBER] pmemd.cuda issues on Tesla S2050

Hi guys,
The driver issue may indeed be the cause of disappearing GPUs, as Ross
pointed out (in my case, anyway). They don't do that on the 6- and 8-GPU
systems. In my case, unfortunately, running deviceQuery as root doesn't
do the trick (and yes, I've used it before on other hosts). For some
reason it still insists that CUDA and device drivers are mismatched and
can't find any cards. The only thing that kicks any sense into it is a
system reboot.

Scott, what's the procedure for getting Nvidia's attention with regard
to drivers possibly supporting more GPUs? I assume that Nvidia should be
interested in this issue due to the potential volume of such multi-gpu
setups once they become reliable..

The Amber11 freezing occurs on Teslas with any number of GPUs - it's
happened on 6- and 8-card systems as well. Also, these lockups don't
appear to be associated with the other problem. When pmemd.cuda locks
up, the system still recognizes all installed GPUs.
Can you think of a good way to deal with it? Before we dive into
complicated wrapper scripts with timeouts..

Thanks

Sasha

Ross Walker wrote:
> Hi Peter
>
> I have seen something similar but only on boot. Essentially right after
> booting, in RHEL5 regular users cannot see the cuda cards. Happens if you
> boot into init 3. However, if root first runs the deviceQuery command from
> the SDK then all is good and the users themselves can then see the cards.
>
> Beyond that I would check to make sure you do not have any power management
> enabled in the bios. Maybe something is causing it to power down the cards.
>
> Next time they vanish have root run the deviceQuery command and see if they
> re-appear.
>
> All the best
> Ross
>
>
>> -----Original Message-----
>> From: Peter Schmidtke [mailto:pschmidtke.mmb.pcb.ub.es]
>> Sent: Tuesday, February 08, 2011 5:22 PM
>> To: amber.ambermd.org
>> Subject: Re: [AMBER] pmemd.cuda issues on Tesla S2050
>>
>> Dear Sasha, Ross & Scott,
>>
>> I'd like to join Sasha in the fact that GPU's are not recognized once in
>> a while and I saw these issues on blades with only 2 Teslas inside.
>> Instead of rebooting it appeared that simply running the cuda tests of
>> amber can do miracles.
>>
>> Anyone has an idea why? I'd like to add, if pmemd.cuda does not
>> recognize a tesla, other cuda based tools (pycuda for instance) don't do
>> either.
>>
>> Thanks.
>>
>> Peter
>>
>> On 02/09/2011 01:57 AM, Sasha Buzko wrote:
>>
>>> Hi Ross and Scott,
>>> I've been running some heavy simulations on a cluster with 12 GPUs per
>>> node (a custom build with dual port PCI-E cards on a board with 3 full
>>> speed PCI-E slots + 1 IB). These are array jobs and run independently of
>>> each other. The host systems are dual 6-core Intel chips (1 core per
>>>
> GPU).
>
>>> We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
>>> series cards. It turns out that it happens on Tesla as well. Far more
>>> rarely, but it still does. The only reason I'm picking it up is due to
>>> the amount of simulations in my case. Each frozen process continues to
>>> consume 100% of a core capacity, but stops producing any output. The
>>> systems are not particularly big (about 20k atoms).
>>> Can you think of any possible explanation for this behavior? Can it be
>>> related to the issues you've been seeing with GTX cards?
>>>
>>> Another strange problem is disappearing GPUs - every once in a while a
>>> node will stop recognizing the GPUs and will only see them after a
>>> reboot. It would happen between consecutive jobs for no apparent
>>>
>> reason.
>>
>>> It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
>>> it be that the system is having trouble keeping track of all 12 cards
>>> somehow? Can you think of a remedy for that?
>>>
>>> Once again, it only happens occasionally. The problem is that all
>>> remaining jobs get funneled through such failed node and become wasted
>>> due to the lack of proper processing. This effectively kills the entire
>>> remaining set of jobs.
>>>
>>>
>>> Thanks in advance for any thoughts in this regard.
>>>
>>> Sasha
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Feb 10 2011 - 07:30:02 PST