Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node. from Jan-Philip Gehrcke on 2013-05-27 (Amber Archive May 2013)

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Mon, 27 May 2013 18:04:15 +0200

Hello!

On 27.05.2013 12:41, ET wrote:
> Thanks for your quick reply Jan-Phillip,
>
> Regarding your comments. I was aware of the $PBS_GPUFILE and had set the
> variable in my .bashrc. However, I could never find the text file in the
> location I had set.

Setting this environment variable in any of your scripts does not make
any sense at all. Torque sets this variable when it allocates a GPU job
to a specific node. The job program then is required to 1) read the
environment variable and 2) to read the file it points to in order to
find out which GPU device it should use. Please make sure that you
understand what I am saying here :-).

> Pardon my ignorance, but Is this a necessity to run
> the jobs successfully with TORQUE?

Generally, yes (at least that's the way I understood things -- correct
me if I am wrong). However, there are scenarios when things seem to work
"by accident", e.g. when you have only one GPU per node.

> i.e. Not necessary unless I want the
> granularity of specifying a specific GPU to a specific chain? In the
> end, it seems like the jobs were getting placed OK, but failed only
> because the faulty GPU was interfering with the process.
>
> Thanks again!
>
> br,
> g
>
>
>
>
>
>
>
> On 26 May 2013 19:30, Jan-Philip Gehrcke <jgehrcke.googlemail.com
> <mailto:jgehrcke.googlemail.com>> wrote:
>
> From your description, I am not sure whether you actually use the
> environment variable PBS_GPUFILE in your jobs. For each GPU job, Torque
> writes the ID of the GPU the job *should* use to a file (this ID has the
> form 'hostname-gpuX'). The path to this file is the value of the
> environment variable PBS_GPUFILE. PBS_GPUFILE is the *only* way your job
> can find out which GPU device it should run on.
>
> In your job program, you have to make sure that you actually use the GPU
> device that Torque wants you to use.
>
> Hence, within your job, you must evaluate PBS_GPUFILE, read the file it
> points to and set CUDA_VISIBLE_DEVICES accordingly. Since CUDA as well
> as Torque start counting GPUs from 0, you can simply map 'hostname-gpuX'
> to CUDA_VISIBLE_DEVICES=X.
>
> Hope this helps,
>
> Jan-Philip
>
>
> On 26.05.2013 09 <tel:26.05.2013%2009>:52, ET wrote:
> > Hi,
> >
> > I was hoping that someone might have some experience with this
> queuing
> > system and may be able to offer some advice.
> >
> > I'm running torque/PBS a very simple setup of 2 Nvidia GPU cards on a
> > single computer that acts as both the server and the compute
> node. The GPUs
> > are set to exclusive mode and I include the command:
> >
> > #PBS -l nodes=1:ppn=1:gpus=1:exclusive_process
> >
> > I set up a series of jobs up with the commands:
> >
> >
> > TS_TASKid=`qsub -d ${absolDIR} ./job1.sh `
> >
> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR}
> job2.sh`
> >
> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR}
> job3.sh`
> >
> > If I have one chain of dependent jobs, I have no issues and
> everything
> > works fine. However, if I have two chains of dependent jobs,
> things are OK
> > for a while, then one chain crashes as Torque tries to submit a
> job to a
> > GPU that already has a job.
> >
> > Is there any way around this? I tried setting each separate queue
> with a
> > specific value. e.g.
> >
> > export CUDA_VISIBLE_DEVICES="0" # chain 1 of jobs
> > export CUDA_VISIBLE_DEVICES="1" # chain 2 of jobs
> >
> > However, this does not work as I guess Torque/PBS uses its own
> internal
> > method for assigning which GPU gets the job. I've searched the
> web and
> > manual, but have not found anything that really works to deal
> with this
> > issue.
> >
> > Any help or pointers on anything that I have missed would be greatly
> > appreciated.
> >
> > br,
> > g
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org <mailto:AMBER.ambermd.org>
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
> http://lists.ambermd.org/mailman/listinfo/amber
>
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 27 2013 - 09:30:02 PDT