Re: [AMBER] Error selecting compatible GPU all CUDA-capable devices are busy or unavailable

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 14 May 2015 10:14:08 -0400

On Wed, May 13, 2015 at 6:02 PM, Jagga Soorma <jagga13.gmail.com> wrote:

> Hi All,
>
> We have a small cluster with some K20X nvidia gpu's that we run amber
> jobs on via slurm. From time to time we have a bunch of amber jobs
> that land on a specific node start failing and this creates a black
> hole. The error message that is reported is "Error selecting
> compatible GPU all CUDA-capable devices are busy or unavailable". We
> have to drain the node in most cases and reboot in order to get things
> back. This might not be a amber/pmemd.cuda related error but wanted
> to ask on this list just in case others have seen this issue and if
> so, how can this be proactively identified. I don't see anything in
> the slurm scheduler that can help us with this so wondering if there
> is someway within amber or nvidia's driver that we can solve this
> issue.
>

​Amber doesn't have anything like this. Do you know what's causing the
problem? Perhaps a ghost process that is not correctly killed if a job
runs over its wallclock?

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 14 2015 - 07:30:02 PDT
Custom Search