Hi Sasha,
Will reply more in depth later but the first issue is that as far as I am aware the current driver is only designed to support 8 GPUS per OS. So this probably explains the disappearing GPUs. You might need to escalate that directly with NVIDIAs driver people. There may be a beta driver for more than 8 GPUs.
All the best
On Feb 8, 2011, at 16:57, Sasha Buzko <obuzko.ucla.edu> wrote:
> Hi Ross and Scott,
> I've been running some heavy simulations on a cluster with 12 GPUs per
> node (a custom build with dual port PCI-E cards on a board with 3 full
> speed PCI-E slots + 1 IB). These are array jobs and run independently of
> each other. The host systems are dual 6-core Intel chips (1 core per GPU).
> We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
> series cards. It turns out that it happens on Tesla as well. Far more
> rarely, but it still does. The only reason I'm picking it up is due to
> the amount of simulations in my case. Each frozen process continues to
> consume 100% of a core capacity, but stops producing any output. The
> systems are not particularly big (about 20k atoms).
> Can you think of any possible explanation for this behavior? Can it be
> related to the issues you've been seeing with GTX cards?
> Another strange problem is disappearing GPUs - every once in a while a
> node will stop recognizing the GPUs and will only see them after a
> reboot. It would happen between consecutive jobs for no apparent reason.
> It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
> it be that the system is having trouble keeping track of all 12 cards
> somehow? Can you think of a remedy for that?
> Once again, it only happens occasionally. The problem is that all
> remaining jobs get funneled through such failed node and become wasted
> due to the lack of proper processing. This effectively kills the entire
> remaining set of jobs.
> Thanks in advance for any thoughts in this regard.
> Sasha
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
AMBER mailing list
Received on Tue Feb 08 2011 - 17:30:04 PST