Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node.

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 29 May 2013 10:15:33 -0400

On Mon, May 27, 2013 at 12:04 PM, Jan-Philip Gehrcke <
jgehrcke.googlemail.com> wrote:

> Hello!
>
> On 27.05.2013 12:41, ET wrote:
> > Thanks for your quick reply Jan-Phillip,
> >
> > Regarding your comments. I was aware of the $PBS_GPUFILE and had set the
> > variable in my .bashrc. However, I could never find the text file in the
> > location I had set.
>
> Setting this environment variable in any of your scripts does not make
> any sense at all. Torque sets this variable when it allocates a GPU job
> to a specific node. The job program then is required to 1) read the
> environment variable and 2) to read the file it points to in order to
> find out which GPU device it should use. Please make sure that you
> understand what I am saying here :-).
>
> > Pardon my ignorance, but Is this a necessity to run
> > the jobs successfully with TORQUE?
>
> Generally, yes (at least that's the way I understood things -- correct
> me if I am wrong). However, there are scenarios when things seem to work
> "by accident", e.g. when you have only one GPU per node.
>

I agree that, out of courtesy to other users, you should use PBS_GPUFILE if
multiple jobs can be dispatched to the same node and some nodes have
multiple GPUs available. If you use every GPU on the node (and you request
a full node via the PBS resource list), then there is no general need to
lock specific threads to specific GPUs.

However, if you have a small local cluster that your group is _just_
running Amber on, either _everyone_ should respect PBS_GPUFILE or nobody
should. If nobody respects it, then the amber jobs will run where they
will and Torque will make sure the GPUs are not over-subscribed in general.
 If everyone respects it, the jobs will run where torque specified they can
run and everybody's happy. If some respect it and some don't, then there
may be clashes between where the system decides to put certain jobs and
where Torque tries to put them, inevitably leading to conflicts.

(The above analysis comes from experience with such a cluster)

HTH,
Jason

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 07:30:04 PDT
Custom Search