Re: [AMBER] Run Amber on a system with multiple GPUs

From: Sasha Buzko <obuzko.ucla.edu>
Date: Fri, 02 Sep 2011 16:27:21 -0700

Hi Peter,
we have a similar situation and use something of a workaround by adding
extra code to the scripts run by SGE. Maybe there is a more elegant
solution, but this one works pretty well.

Each server has a scratch directory for actual job execution and writing
results. And there is a world-writable /cuda directory for lock files
corresponding to each GPU (cuda.0, cuda.1, etc.). The code at the
beginning of each job script checks for each scratch directory - if it
exists, it goes on to the next GPU id, if it doesn't it tries to create
one. In case of an error (another process creating it) it skips to the
enxt one, until it runs out of ids. Once the $device is set, this
variable is used in the script that launches Amber (..... -gpu $device).
At the end of each job the lock file is deleted, and a new job can step
in. The number of GPUs obviously has to equal the number of slots
assigned to the server by the SGE queue.
Below is the relevant portion of an individual script of an array job
run by SGE. Here, the total is 12 GPUs, so the max index is 11. Strictly
speaking, /cuda/cuda.* locks are not necessary, but got stuck in the
script for historical reasons.

Hope it helps

Sasha



#!/bin/bash
pattern="cannot create directory"
device=-1
total=11;
for ((i=0; i <= total; i++))
do
    if [ -d "/scratch/sasha/cuda.$i" ]
    then
        continue
    else
        device=$i;
        export OUT=`/bin/mkdir /scratch/sasha/cuda.$device 2>&1`
        if [[ $OUT =~ $pattern ]]; then
            continue
        else
            touch /cuda/cuda.$device;
            break
        fi
    fi
done
if [ "$device" = "-1" ]; then
    exit
fi
cd /scratch/sasha/cuda.$device
echo "Running on node $name, device $device"


peter.stauffert.boehringer-ingelheim.com wrote:
> Hi All,
>
> we are running multiple pmemd.cuda jobs on a system with multiple nVidia 2050
> GPUs, the GPUs are set in exclusive mode by nvidia-smi.
> Of course, we can force a job to run on a specific GPU by 'pmemd -gpu n' or
> by setting the environment variable CUDA_VISIBLE_DEVICES="n".
> But unfortunately our scheduling system (SGE) does not know, which GPU is in
> use, with the SGE, a GPU is simply a consumable resource like a license.
> pmemd.cuda always uses the GPU with the highest device number if no GPU
> number is specified even if this GPU is busy, error message:
> cudaMemcpyToSymbol: SetSim copy to cSim failed all CUDA-capable devices are
> busy or unavailable
>
> Is there an option to force pmemd.cuda to use only empty GPUs?
> Or is there a tool to test, whether a GPU is busy by another thread?
>
> Of course, we can start pmemd.cuda and when it stops with the error message
> above, we can restart it with a lower gpu number, but this is obviously no
> good solution.
>
> Kind regards,
>
> Peter
>
> Dr. Peter Stauffert
> Boehringer Ingelheim Pharma GmbH & Co. KG
> mailto:peter.stauffert.boehringer-ingelheim.com
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 02 2011 - 17:00:02 PDT
Custom Search