Hi Sasha,
Wow, you really push the envelope. ;-)
The first issue is that as far as I am aware the current driver is only
designed to support 8 GPUS per OS. So this probably explains the
disappearing GPUs. You might need to escalate that directly with NVIDIAs
driver people. There may be a beta driver for more than 8 GPUs. Thus I would
be interested to know if you see the same problem if you run on a node with
8 or 6 GPUs in. I.e. is the lockup occurring because the driver is dropping
out one of the GPUs?
All the best
Ross
> -----Original Message-----
> From: Sasha Buzko [mailto:obuzko.ucla.edu]
> Sent: Tuesday, February 08, 2011 4:58 PM
> To: AMBER Mailing List
> Subject: [AMBER] pmemd.cuda issues on Tesla S2050
>
> Hi Ross and Scott,
> I've been running some heavy simulations on a cluster with 12 GPUs per
> node (a custom build with dual port PCI-E cards on a board with 3 full
> speed PCI-E slots + 1 IB). These are array jobs and run independently of
> each other. The host systems are dual 6-core Intel chips (1 core per GPU).
>
> We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
> series cards. It turns out that it happens on Tesla as well. Far more
> rarely, but it still does. The only reason I'm picking it up is due to
> the amount of simulations in my case. Each frozen process continues to
> consume 100% of a core capacity, but stops producing any output. The
> systems are not particularly big (about 20k atoms).
> Can you think of any possible explanation for this behavior? Can it be
> related to the issues you've been seeing with GTX cards?
>
> Another strange problem is disappearing GPUs - every once in a while a
> node will stop recognizing the GPUs and will only see them after a
> reboot. It would happen between consecutive jobs for no apparent reason.
> It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
> it be that the system is having trouble keeping track of all 12 cards
> somehow? Can you think of a remedy for that?
>
> Once again, it only happens occasionally. The problem is that all
> remaining jobs get funneled through such failed node and become wasted
> due to the lack of proper processing. This effectively kills the entire
> remaining set of jobs.
>
>
> Thanks in advance for any thoughts in this regard.
>
> Sasha
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 08 2011 - 19:30:03 PST