[AMBER] Chained AMBER jobs crash on Dual GPU compute node.

From: ET <sketchfoot.gmail.com>
Date: Sun, 26 May 2013 08:52:54 +0100

Hi,

I was hoping that someone might have some experience with this queuing
system and may be able to offer some advice.

I'm running torque/PBS a very simple setup of 2 Nvidia GPU cards on a
single computer that acts as both the server and the compute node. The GPUs
are set to exclusive mode and I include the command:

#PBS -l nodes=1:ppn=1:gpus=1:exclusive_process

I set up a series of jobs up with the commands:


TS_TASKid=`qsub -d ${absolDIR} ./job1.sh `

TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job2.sh`

TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job3.sh`

If I have one chain of dependent jobs, I have no issues and everything
works fine. However, if I have two chains of dependent jobs, things are OK
for a while, then one chain crashes as Torque tries to submit a job to a
GPU that already has a job.

Is there any way around this? I tried setting each separate queue with a
specific value. e.g.

export CUDA_VISIBLE_DEVICES="0" # chain 1 of jobs
export CUDA_VISIBLE_DEVICES="1" # chain 2 of jobs

However, this does not work as I guess Torque/PBS uses its own internal
method for assigning which GPU gets the job. I've searched the web and
manual, but have not found anything that really works to deal with this
issue.

Any help or pointers on anything that I have missed would be greatly
appreciated.

br,
g
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun May 26 2013 - 01:00:02 PDT
Custom Search