Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node.

From: ET <sketchfoot.gmail.com>
Date: Tue, 28 May 2013 10:51:37 +0100

Hi,

Thanks for the information! :) Once the replacement card comes back, I'll
implement something that takes advantage of the $PBS_GPUFILE as you
suggested and if I find different, I'll let you know.

br,
g


On 27 May 2013 17:04, Jan-Philip Gehrcke <jgehrcke.googlemail.com> wrote:

> Hello!
>
>
> On 27.05.2013 12:41, ET wrote:
>
>> Thanks for your quick reply Jan-Phillip,
>>
>> Regarding your comments. I was aware of the $PBS_GPUFILE and had set the
>> variable in my .bashrc. However, I could never find the text file in the
>> location I had set.
>>
>
> Setting this environment variable in any of your scripts does not make any
> sense at all. Torque sets this variable when it allocates a GPU job to a
> specific node. The job program then is required to 1) read the environment
> variable and 2) to read the file it points to in order to find out which
> GPU device it should use. Please make sure that you understand what I am
> saying here :-).
>
>
> Pardon my ignorance, but Is this a necessity to run
>> the jobs successfully with TORQUE?
>>
>
> Generally, yes (at least that's the way I understood things -- correct me
> if I am wrong). However, there are scenarios when things seem to work "by
> accident", e.g. when you have only one GPU per node.
>
> i.e. Not necessary unless I want the
>> granularity of specifying a specific GPU to a specific chain? In the
>> end, it seems like the jobs were getting placed OK, but failed only
>> because the faulty GPU was interfering with the process.
>>
>> Thanks again!
>>
>> br,
>> g
>>
>>
>>
>>
>>
>>
>>
>> On 26 May 2013 19:30, Jan-Philip Gehrcke <jgehrcke.googlemail.com
>> <mailto:jgehrcke.googlemail.**com <jgehrcke.googlemail.com>>> wrote:
>>
>> From your description, I am not sure whether you actually use the
>> environment variable PBS_GPUFILE in your jobs. For each GPU job,
>> Torque
>> writes the ID of the GPU the job *should* use to a file (this ID has
>> the
>> form 'hostname-gpuX'). The path to this file is the value of the
>> environment variable PBS_GPUFILE. PBS_GPUFILE is the *only* way your
>> job
>> can find out which GPU device it should run on.
>>
>> In your job program, you have to make sure that you actually use the
>> GPU
>> device that Torque wants you to use.
>>
>> Hence, within your job, you must evaluate PBS_GPUFILE, read the file
>> it
>> points to and set CUDA_VISIBLE_DEVICES accordingly. Since CUDA as well
>> as Torque start counting GPUs from 0, you can simply map
>> 'hostname-gpuX'
>> to CUDA_VISIBLE_DEVICES=X.
>>
>> Hope this helps,
>>
>> Jan-Philip
>>
>>
>> On 26.05.2013 09 <tel:26.05.2013%2009>:52, ET wrote:
>> > Hi,
>> >
>> > I was hoping that someone might have some experience with this
>> queuing
>> > system and may be able to offer some advice.
>> >
>> > I'm running torque/PBS a very simple setup of 2 Nvidia GPU cards
>> on a
>> > single computer that acts as both the server and the compute
>> node. The GPUs
>> > are set to exclusive mode and I include the command:
>> >
>> > #PBS -l nodes=1:ppn=1:gpus=1:**exclusive_process
>> >
>> > I set up a series of jobs up with the commands:
>> >
>> >
>> > TS_TASKid=`qsub -d ${absolDIR} ./job1.sh `
>> >
>> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR}
>> job2.sh`
>> >
>> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR}
>> job3.sh`
>> >
>> > If I have one chain of dependent jobs, I have no issues and
>> everything
>> > works fine. However, if I have two chains of dependent jobs,
>> things are OK
>> > for a while, then one chain crashes as Torque tries to submit a
>> job to a
>> > GPU that already has a job.
>> >
>> > Is there any way around this? I tried setting each separate queue
>> with a
>> > specific value. e.g.
>> >
>> > export CUDA_VISIBLE_DEVICES="0" # chain 1 of jobs
>> > export CUDA_VISIBLE_DEVICES="1" # chain 2 of jobs
>> >
>> > However, this does not work as I guess Torque/PBS uses its own
>> internal
>> > method for assigning which GPU gets the job. I've searched the
>> web and
>> > manual, but have not found anything that really works to deal
>> with this
>> > issue.
>> >
>> > Any help or pointers on anything that I have missed would be
>> greatly
>> > appreciated.
>> >
>> > br,
>> > g
>> > ______________________________**_________________
>> > AMBER mailing list
>> > AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>>
>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>> >
>>
>>
>> ______________________________**_________________
>> AMBER mailing list
>> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>
>>
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 28 2013 - 03:00:02 PDT
Custom Search