Re: [AMBER] GPU job scheduling: looking for recommendations and experiences

From: Jonathan Gough <jonathan.d.gough.gmail.com>
Date: Mon, 18 Mar 2013 12:10:16 -0400

Hi Jan-Philip,

I don't have a working solution at the moment, but am actually trying to
figure this out myself. I have a small cluster running ROCKS using the
distributed version of torque. However, much to my chagrin I have found
that it is not by default configured for GPU's/NVIDIA. I have had great
experience with torque for scheduling cpu jobs.

I am contemplating manually re-installing torque and configuring it for
nvidia. The other option I am weighing is using sge. the distributed
version of sge on rocks supposedly works. A kind gentleman Gowtham at
Michigan tech wrote up a description on how to set up sge to utilize gpus.
 He also detailed some of the issues in deploying NVIDIA drivers via rocks
(i found it very helpful) on the ROCKS discussion board.

http://sgowtham.net/ramblings/

here are the basic directions for deploying nvidia via rocks.
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2013-March/061813.html

I believe that going forward there may be compatibility issues with torque
(I could be wrong). The issue is that often torque is deployed with maui
which is the free cluster scheduler- but outdated (2003 was the last
patch). I'm looking into this now and can update you as I go forward.

Hope that helps,
JOnathan



On Mon, Mar 18, 2013 at 11:04 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Jan-Philip,
>
> Both Torque and Slurm are known to work. I have setup Slurm allowing
> shared use of nodes which lets you handout GPUs on an individual basis
> rather than as a whole node. Slurm's documentation is pretty lacking
> though - there's lots of it but they tend to just focus on stupidly
> complicated and arcane examples so it can be a pain to figure out.
>
> Torque I have not used but others have reported success with it - there is
> a minor issue in that the Maui scheduler doesn't properly understand GPUs
> even though torque does so you can't run with Maui unless you are happy
> allocating at node rather than GPU granularity.
>
> Hope that helps. Others can hopefully give you more detailed info.
>
> All the best
> Ross
>
>
>
>
>
> On 3/18/13 6:35 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
> >Hello,
> >
> >I am aiming to set up a free GPU job scheduling solution in order to
> >distribute (Amber) GPU computing jobs among various nodes containing
> >CUDA devices. However, the corresponding resources in the web are still
> >scarce. When searching the web for the term "GPU job scheduling" then
> >this short list hosted by Nvidia is the most informative result:
> >https://developer.nvidia.com/job-scheduling
> >
> >I am currently looking into setting up Torque which "supports" NVIDIA
> >GPUs:
> >
> http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/3-nodes/NVID
> >IAGPGPUs.htm
> >
> >However, before proceeding, I would be very interested to hear about the
> >experiences others have made.
> >
> >So, if you have set up a GPU cluster with proper job management based on
> >free software, then I would be happy to read about the scheduler
> >software of your choice, complications you ran into, and other
> >experiences you find worth mentioning. Sure that this would help not
> >only me!
> >
> >
> >Thanks a lot,
> >
> >Jan-Philip
> >
> >
> >
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 18 2013 - 09:30:02 PDT
Custom Search