Re: [AMBER] Chained AMBER jobs crash on Dual GPU compute node. from ET on 2013-05-29 (Amber Archive May 2013)

From: ET <sketchfoot.gmail.com>
Date: Wed, 29 May 2013 17:39:56 +0100

Hi Jason,

Thanks very much for the information! In the present situation it is only
me running the jobs, but I'm planning on getting TORQUE installed on some
of the department machines. In that situation, from what has been said, it
seems wise to not mention the PBS_GPUFILE and let the system allocate
resource accordingly. This should be ok as all the GPUs are the same and
thus I don't see a reason why someone would prefer one over another.

br,
g

On 29 May 2013 15:15, Jason Swails <jason.swails.gmail.com> wrote:

> On Mon, May 27, 2013 at 12:04 PM, Jan-Philip Gehrcke <
> jgehrcke.googlemail.com> wrote:
>
> > Hello!
> >
> > On 27.05.2013 12:41, ET wrote:
> > > Thanks for your quick reply Jan-Phillip,
> > >
> > > Regarding your comments. I was aware of the $PBS_GPUFILE and had set
> the
> > > variable in my .bashrc. However, I could never find the text file in
> the
> > > location I had set.
> >
> > Setting this environment variable in any of your scripts does not make
> > any sense at all. Torque sets this variable when it allocates a GPU job
> > to a specific node. The job program then is required to 1) read the
> > environment variable and 2) to read the file it points to in order to
> > find out which GPU device it should use. Please make sure that you
> > understand what I am saying here :-).
> >
> > > Pardon my ignorance, but Is this a necessity to run
> > > the jobs successfully with TORQUE?
> >
> > Generally, yes (at least that's the way I understood things -- correct
> > me if I am wrong). However, there are scenarios when things seem to work
> > "by accident", e.g. when you have only one GPU per node.
> >
>
> I agree that, out of courtesy to other users, you should use PBS_GPUFILE if
> multiple jobs can be dispatched to the same node and some nodes have
> multiple GPUs available. If you use every GPU on the node (and you request
> a full node via the PBS resource list), then there is no general need to
> lock specific threads to specific GPUs.
>
> However, if you have a small local cluster that your group is _just_
> running Amber on, either _everyone_ should respect PBS_GPUFILE or nobody
> should. If nobody respects it, then the amber jobs will run where they
> will and Torque will make sure the GPUs are not over-subscribed in general.
> If everyone respects it, the jobs will run where torque specified they can
> run and everybody's happy. If some respect it and some don't, then there
> may be clashes between where the system decides to put certain jobs and
> where Torque tries to put them, inevitably leading to conflicts.
>
> (The above analysis comes from experience with such a cluster)
>
> HTH,
> Jason
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 10:00:02 PDT