Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node.

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Sun, 26 May 2013 20:30:10 +0200

 From your description, I am not sure whether you actually use the
environment variable PBS_GPUFILE in your jobs. For each GPU job, Torque
writes the ID of the GPU the job *should* use to a file (this ID has the
form 'hostname-gpuX'). The path to this file is the value of the
environment variable PBS_GPUFILE. PBS_GPUFILE is the *only* way your job
can find out which GPU device it should run on.

In your job program, you have to make sure that you actually use the GPU
device that Torque wants you to use.

Hence, within your job, you must evaluate PBS_GPUFILE, read the file it
points to and set CUDA_VISIBLE_DEVICES accordingly. Since CUDA as well
as Torque start counting GPUs from 0, you can simply map 'hostname-gpuX'
to CUDA_VISIBLE_DEVICES=X.

Hope this helps,

Jan-Philip


On 26.05.2013 09:52, ET wrote:
> Hi,
>
> I was hoping that someone might have some experience with this queuing
> system and may be able to offer some advice.
>
> I'm running torque/PBS a very simple setup of 2 Nvidia GPU cards on a
> single computer that acts as both the server and the compute node. The GPUs
> are set to exclusive mode and I include the command:
>
> #PBS -l nodes=1:ppn=1:gpus=1:exclusive_process
>
> I set up a series of jobs up with the commands:
>
>
> TS_TASKid=`qsub -d ${absolDIR} ./job1.sh `
>
> TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job2.sh`
>
> TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job3.sh`
>
> If I have one chain of dependent jobs, I have no issues and everything
> works fine. However, if I have two chains of dependent jobs, things are OK
> for a while, then one chain crashes as Torque tries to submit a job to a
> GPU that already has a job.
>
> Is there any way around this? I tried setting each separate queue with a
> specific value. e.g.
>
> export CUDA_VISIBLE_DEVICES="0" # chain 1 of jobs
> export CUDA_VISIBLE_DEVICES="1" # chain 2 of jobs
>
> However, this does not work as I guess Torque/PBS uses its own internal
> method for assigning which GPU gets the job. I've searched the web and
> manual, but have not found anything that really works to deal with this
> issue.
>
> Any help or pointers on anything that I have missed would be greatly
> appreciated.
>
> br,
> g
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun May 26 2013 - 12:00:03 PDT
Custom Search