Hi All,
We have a small cluster with some K20X nvidia gpu's that we run amber
jobs on via slurm. From time to time we have a bunch of amber jobs
that land on a specific node start failing and this creates a black
hole. The error message that is reported is "Error selecting
compatible GPU all CUDA-capable devices are busy or unavailable". We
have to drain the node in most cases and reboot in order to get things
back. This might not be a amber/pmemd.cuda related error but wanted
to ask on this list just in case others have seen this issue and if
so, how can this be proactively identified. I don't see anything in
the slurm scheduler that can help us with this so wondering if there
is someway within amber or nvidia's driver that we can solve this
issue.
Thanks for your help with this.
-J
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 13 2015 - 15:30:02 PDT