Hi Ross and Scott,
I've been running some heavy simulations on a cluster with 12 GPUs per
node (a custom build with dual port PCI-E cards on a board with 3 full
speed PCI-E slots + 1 IB). These are array jobs and run independently of
each other. The host systems are dual 6-core Intel chips (1 core per GPU).
We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
series cards. It turns out that it happens on Tesla as well. Far more
rarely, but it still does. The only reason I'm picking it up is due to
the amount of simulations in my case. Each frozen process continues to
consume 100% of a core capacity, but stops producing any output. The
systems are not particularly big (about 20k atoms).
Can you think of any possible explanation for this behavior? Can it be
related to the issues you've been seeing with GTX cards?
Another strange problem is disappearing GPUs - every once in a while a
node will stop recognizing the GPUs and will only see them after a
reboot. It would happen between consecutive jobs for no apparent reason.
It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
it be that the system is having trouble keeping track of all 12 cards
somehow? Can you think of a remedy for that?
Once again, it only happens occasionally. The problem is that all
remaining jobs get funneled through such failed node and become wasted
due to the lack of proper processing. This effectively kills the entire
remaining set of jobs.
Thanks in advance for any thoughts in this regard.
Sasha
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 08 2011 - 17:00:03 PST