Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node. from ET on 2013-05-27 (Amber Archive May 2013)

From: ET <sketchfoot.gmail.com>
Date: Mon, 27 May 2013 11:41:40 +0100

Thanks for your quick reply Jan-Phillip,

After further investigation, I realised that error I was getting which was

Re: [AMBER] ERROR: max pairlist cutoff must be less than unit cell max
sphere radius!

Was because of a faulty GPU. The Qs were crashing because both the GPUs
were dipping into each others chain depending on which was free at the
time. If the faulty card would process a segment of
the simulation it would write a series of ********* or NANs. Thus it would
first mess up the chain it had been assigned to at the start. Generally
manifesting itself around the constant pressure stage in an NTP sim. Then
it would dip into the other chain as its own chain had disappeared due to
error and it was more available. It would then do it's little trick and
ruin that chain of jobs too!

I verified that this was the issue by queuing the jobs in bash and
bypassing TORQUE/PBS while getting the card to process the simulation on
its own. It consistently failed, whilst the other card had no issues with
the same simulation. So I've just RMA'd the card and hopefully should get a
new one soon.

Regarding your comments. I was aware of the $PBS_GPUFILE and had set the
variable in my .bashrc. However, I could never find the text file in the
location I had set. Pardon my ignorance, but Is this a necessity to run the
jobs successfully with TORQUE? i.e. Not necessary unless I want the
granularity of specifying a specific GPU to a specific chain? In the end,
it seems like the jobs were getting placed OK, but failed only because the
faulty GPU was interfering with the process.

Thanks again!

br,
g

On 26 May 2013 19:30, Jan-Philip Gehrcke <jgehrcke.googlemail.com> wrote:

> From your description, I am not sure whether you actually use the
> environment variable PBS_GPUFILE in your jobs. For each GPU job, Torque
> writes the ID of the GPU the job *should* use to a file (this ID has the
> form 'hostname-gpuX'). The path to this file is the value of the
> environment variable PBS_GPUFILE. PBS_GPUFILE is the *only* way your job
> can find out which GPU device it should run on.
>
> In your job program, you have to make sure that you actually use the GPU
> device that Torque wants you to use.
>
> Hence, within your job, you must evaluate PBS_GPUFILE, read the file it
> points to and set CUDA_VISIBLE_DEVICES accordingly. Since CUDA as well
> as Torque start counting GPUs from 0, you can simply map 'hostname-gpuX'
> to CUDA_VISIBLE_DEVICES=X.
>
> Hope this helps,
>
> Jan-Philip
>
>
> On 26.05.2013 09:52, ET wrote:
> > Hi,
> >
> > I was hoping that someone might have some experience with this queuing
> > system and may be able to offer some advice.
> >
> > I'm running torque/PBS a very simple setup of 2 Nvidia GPU cards on a
> > single computer that acts as both the server and the compute node. The
> GPUs
> > are set to exclusive mode and I include the command:
> >
> > #PBS -l nodes=1:ppn=1:gpus=1:exclusive_process
> >
> > I set up a series of jobs up with the commands:
> >
> >
> > TS_TASKid=`qsub -d ${absolDIR} ./job1.sh `
> >
> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job2.sh`
> >
> > TS_TASKid=`qsub -W depend=afterany:${TS_TASKid} -d ${absolDIR} job3.sh`
> >
> > If I have one chain of dependent jobs, I have no issues and everything
> > works fine. However, if I have two chains of dependent jobs, things are
> OK
> > for a while, then one chain crashes as Torque tries to submit a job to a
> > GPU that already has a job.
> >
> > Is there any way around this? I tried setting each separate queue with a
> > specific value. e.g.
> >
> > export CUDA_VISIBLE_DEVICES="0" # chain 1 of jobs
> > export CUDA_VISIBLE_DEVICES="1" # chain 2 of jobs
> >
> > However, this does not work as I guess Torque/PBS uses its own internal
> > method for assigning which GPU gets the job. I've searched the web and
> > manual, but have not found anything that really works to deal with this
> > issue.
> >
> > Any help or pointers on anything that I have missed would be greatly
> > appreciated.
> >
> > br,
> > g
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 27 2013 - 04:00:02 PDT