Dear Sasha, Ross & Scott,
I'd like to join Sasha in the fact that GPU's are not recognized once in
a while and I saw these issues on blades with only 2 Teslas inside.
Instead of rebooting it appeared that simply running the cuda tests of
amber can do miracles.
Anyone has an idea why? I'd like to add, if pmemd.cuda does not
recognize a tesla, other cuda based tools (pycuda for instance) don't do
either.
Thanks.
Peter
On 02/09/2011 01:57 AM, Sasha Buzko wrote:
> Hi Ross and Scott,
> I've been running some heavy simulations on a cluster with 12 GPUs per
> node (a custom build with dual port PCI-E cards on a board with 3 full
> speed PCI-E slots + 1 IB). These are array jobs and run independently of
> each other. The host systems are dual 6-core Intel chips (1 core per GPU).
>
> We've all seen the issue of freezing Amber11 simulations on GTX4*/5*
> series cards. It turns out that it happens on Tesla as well. Far more
> rarely, but it still does. The only reason I'm picking it up is due to
> the amount of simulations in my case. Each frozen process continues to
> consume 100% of a core capacity, but stops producing any output. The
> systems are not particularly big (about 20k atoms).
> Can you think of any possible explanation for this behavior? Can it be
> related to the issues you've been seeing with GTX cards?
>
> Another strange problem is disappearing GPUs - every once in a while a
> node will stop recognizing the GPUs and will only see them after a
> reboot. It would happen between consecutive jobs for no apparent reason.
> It hasn't happened on my other cluster with 6 and 8 GPUs per node. Could
> it be that the system is having trouble keeping track of all 12 cards
> somehow? Can you think of a remedy for that?
>
> Once again, it only happens occasionally. The problem is that all
> remaining jobs get funneled through such failed node and become wasted
> due to the lack of proper processing. This effectively kills the entire
> remaining set of jobs.
>
>
> Thanks in advance for any thoughts in this regard.
>
> Sasha
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 08 2011 - 17:30:03 PST