Re: [AMBER] pmemd.cuda issues on Tesla S2050

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 8 Feb 2011 19:29:08 -0800

Hi Peter

I have seen something similar but only on boot. Essentially right after
booting, in RHEL5 regular users cannot see the cuda cards. Happens if you
boot into init 3. However, if root first runs the deviceQuery command from
the SDK then all is good and the users themselves can then see the cards.

Beyond that I would check to make sure you do not have any power management
enabled in the bios. Maybe something is causing it to power down the cards.

Next time they vanish have root run the deviceQuery command and see if they
re-appear.

All the best
Ross

> -----Original Message-----
> From: Peter Schmidtke [mailto:pschmidtke.mmb.pcb.ub.es]
> Sent: Tuesday, February 08, 2011 5:22 PM
> To: amber.ambermd.org
> Subject: Re: [AMBER] pmemd.cuda issues on Tesla S2050
>
> Dear Sasha, Ross & Scott,
>
> I'd like to join Sasha in the fact that GPU's are not recognized once in
> a while and I saw these issues on blades with only 2 Teslas inside.
> Instead of rebooting it appeared that simply running the cuda tests of
> amber can do miracles.
>
> Anyone has an idea why? I'd like to add, if pmemd.cuda does not
> recognize a tesla, other cuda based tools (pycuda for instance) don't do
> either.
>
> Thanks.
>
> Peter
>
> On 02/09/2011 01:57 AM, Sasha Buzko wrote:
> > Hi Ross and Scott,
> > I've been running some heavy simulations on a cluster with 12 GPUs per
> > node (a custom build with dual port PCI-E cards on a board with 3 full
> > speed PCI-E slots + 1 IB). These are array jobs and run independently of
> > each other. The host systems are dual 6-core Intel chips (1 core per
GPU).
> >
> > We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
> > series cards. It turns out that it happens on Tesla as well. Far more
> > rarely, but it still does. The only reason I'm picking it up is due to
> > the amount of simulations in my case. Each frozen process continues to
> > consume 100% of a core capacity, but stops producing any output. The
> > systems are not particularly big (about 20k atoms).
> > Can you think of any possible explanation for this behavior? Can it be
> > related to the issues you've been seeing with GTX cards?
> >
> > Another strange problem is disappearing GPUs - every once in a while a
> > node will stop recognizing the GPUs and will only see them after a
> > reboot. It would happen between consecutive jobs for no apparent
> reason.
> > It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
> > it be that the system is having trouble keeping track of all 12 cards
> > somehow? Can you think of a remedy for that?
> >
> > Once again, it only happens occasionally. The problem is that all
> > remaining jobs get funneled through such failed node and become wasted
> > due to the lack of proper processing. This effectively kills the entire
> > remaining set of jobs.
> >
> >
> > Thanks in advance for any thoughts in this regard.
> >
> > Sasha
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 08 2011 - 19:30:04 PST
Custom Search