Re: [AMBER] pmemd.cuda issues on Tesla S2050

From: Sasha Buzko <obuzko.ucla.edu>
Date: Wed, 02 Mar 2011 09:54:08 -0800

Hi guys,
in the last couple of weeks, I haven't seen jobs freezing yet, so the
latest patch may have fixed that particular problem.

However, I'm still getting cases of hosts dropping all GPUs in the
middle of a job - apparently, the driver issue that Scott had pointed
to. Is there an update on the horizon that would cover the 12 GPU/host
cases?
I'd appreciate any news.

Thanks

Sasha



Scott Le Grand wrote:
> I'll be filing a bug today. Also, is this only happening with cards inside an S2050 enclosure? I've never seen a C2050 do this since bugfix 12.
>
>
>
> -----Original Message-----
> From: Sasha Buzko [mailto:obuzko.ucla.edu]
> Sent: Wednesday, February 09, 2011 09:24
> To: AMBER Mailing List
> Subject: Re: [AMBER] pmemd.cuda issues on Tesla S2050
>
> Hi guys,
> The driver issue may indeed be the cause of disappearing GPUs, as Ross
> pointed out (in my case, anyway). They don't do that on the 6- and 8-GPU
> systems. In my case, unfortunately, running deviceQuery as root doesn't
> do the trick (and yes, I've used it before on other hosts). For some
> reason it still insists that CUDA and device drivers are mismatched and
> can't find any cards. The only thing that kicks any sense into it is a
> system reboot.
>
> Scott, what's the procedure for getting Nvidia's attention with regard
> to drivers possibly supporting more GPUs? I assume that Nvidia should be
> interested in this issue due to the potential volume of such multi-gpu
> setups once they become reliable..
>
> The Amber11 freezing occurs on Teslas with any number of GPUs - it's
> happened on 6- and 8-card systems as well. Also, these lockups don't
> appear to be associated with the other problem. When pmemd.cuda locks
> up, the system still recognizes all installed GPUs.
> Can you think of a good way to deal with it? Before we dive into
> complicated wrapper scripts with timeouts..
>
> Thanks
>
> Sasha
>
>
>
>
> Ross Walker wrote:
>
>> Hi Peter
>>
>> I have seen something similar but only on boot. Essentially right after
>> booting, in RHEL5 regular users cannot see the cuda cards. Happens if you
>> boot into init 3. However, if root first runs the deviceQuery command from
>> the SDK then all is good and the users themselves can then see the cards.
>>
>> Beyond that I would check to make sure you do not have any power management
>> enabled in the bios. Maybe something is causing it to power down the cards.
>>
>> Next time they vanish have root run the deviceQuery command and see if they
>> re-appear.
>>
>> All the best
>> Ross
>>
>>
>>
>>> -----Original Message-----
>>> From: Peter Schmidtke [mailto:pschmidtke.mmb.pcb.ub.es]
>>> Sent: Tuesday, February 08, 2011 5:22 PM
>>> To: amber.ambermd.org
>>> Subject: Re: [AMBER] pmemd.cuda issues on Tesla S2050
>>>
>>> Dear Sasha, Ross & Scott,
>>>
>>> I'd like to join Sasha in the fact that GPU's are not recognized once in
>>> a while and I saw these issues on blades with only 2 Teslas inside.
>>> Instead of rebooting it appeared that simply running the cuda tests of
>>> amber can do miracles.
>>>
>>> Anyone has an idea why? I'd like to add, if pmemd.cuda does not
>>> recognize a tesla, other cuda based tools (pycuda for instance) don't do
>>> either.
>>>
>>> Thanks.
>>>
>>> Peter
>>>
>>> On 02/09/2011 01:57 AM, Sasha Buzko wrote:
>>>
>>>
>>>> Hi Ross and Scott,
>>>> I've been running some heavy simulations on a cluster with 12 GPUs per
>>>> node (a custom build with dual port PCI-E cards on a board with 3 full
>>>> speed PCI-E slots + 1 IB). These are array jobs and run independently of
>>>> each other. The host systems are dual 6-core Intel chips (1 core per
>>>>
>>>>
>> GPU).
>>
>>
>>>> We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
>>>> series cards. It turns out that it happens on Tesla as well. Far more
>>>> rarely, but it still does. The only reason I'm picking it up is due to
>>>> the amount of simulations in my case. Each frozen process continues to
>>>> consume 100% of a core capacity, but stops producing any output. The
>>>> systems are not particularly big (about 20k atoms).
>>>> Can you think of any possible explanation for this behavior? Can it be
>>>> related to the issues you've been seeing with GTX cards?
>>>>
>>>> Another strange problem is disappearing GPUs - every once in a while a
>>>> node will stop recognizing the GPUs and will only see them after a
>>>> reboot. It would happen between consecutive jobs for no apparent
>>>>
>>>>
>>> reason.
>>>
>>>
>>>> It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
>>>> it be that the system is having trouble keeping track of all 12 cards
>>>> somehow? Can you think of a remedy for that?
>>>>
>>>> Once again, it only happens occasionally. The problem is that all
>>>> remaining jobs get funneled through such failed node and become wasted
>>>> due to the lack of proper processing. This effectively kills the entire
>>>> remaining set of jobs.
>>>>
>>>>
>>>> Thanks in advance for any thoughts in this regard.
>>>>
>>>> Sasha
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information. Any unauthorized review, use, disclosure or distribution
> is prohibited. If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 02 2011 - 10:00:08 PST
Custom Search