Re: [AMBER] GPU job scheduling: looking for recommendations and experiences from Jan-Philip Gehrcke on 2013-04-05 (Amber Archive Apr 2013)

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Fri, 05 Apr 2013 16:35:41 +0200

Thanks Ross and Jonathan for your helpful answers. As a follow up, I can
now also summarize my experiences:

I have set up a small cluster based on PBS/Torque 4.1. For a simple
scenario such as ours (a few highly similar graphics cards distributed
among a few machines), it works pretty well.

On the executing node, pbs_mom sets an env variable pointing to a file
telling the job which GPU it *should* use. For our group, I have written
a job submission program and a small job wrapping program that hides all
GPU-related details from the user. CUDA_VISIBLE_DEVICES is properly set
before executing the user-given job command. These small (Python)
programs are available at
https://bitbucket.org/jgehrcke/torque-gpu-compute-jobs/src .

As Ross pointed out, Maui, the advanced scheduler for PBS/Torque, does
not support GPU resources so far. However, recently there have been
(fruitless) efforts to implement corresponding features:
http://www.clusterresources.com/pipermail/mauiusers/2012-March/004897.html
Maybe in the future someone finishes this work properly.

In case of a homogeneous GPU cluster, the minimalistic scheduler
pbs_sched is working just fine. At least for us, the basic queuing
mechanism is sufficient. The only disadvantage here is that if you have
more GPGPUs than simultaneously running jobs, certain graphics cards
will almost never be loaded while others always are.

Cheers,

Jan-Philip

On 03/18/2013 05:10 PM, Jonathan Gough wrote:
> Hi Jan-Philip,
>
> I don't have a working solution at the moment, but am actually trying to
> figure this out myself. I have a small cluster running ROCKS using the
> distributed version of torque. However, much to my chagrin I have found
> that it is not by default configured for GPU's/NVIDIA. I have had great
> experience with torque for scheduling cpu jobs.
>
> I am contemplating manually re-installing torque and configuring it for
> nvidia. The other option I am weighing is using sge. the distributed
> version of sge on rocks supposedly works. A kind gentleman Gowtham at
> Michigan tech wrote up a description on how to set up sge to utilize gpus.
> He also detailed some of the issues in deploying NVIDIA drivers via rocks
> (i found it very helpful) on the ROCKS discussion board.
>
> http://sgowtham.net/ramblings/
>
> here are the basic directions for deploying nvidia via rocks.
> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2013-March/061813.html
>
> I believe that going forward there may be compatibility issues with torque
> (I could be wrong). The issue is that often torque is deployed with maui
> which is the free cluster scheduler- but outdated (2003 was the last
> patch). I'm looking into this now and can update you as I go forward.
>
> Hope that helps,
> JOnathan
>
>
>
> On Mon, Mar 18, 2013 at 11:04 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Jan-Philip,
>>
>> Both Torque and Slurm are known to work. I have setup Slurm allowing
>> shared use of nodes which lets you handout GPUs on an individual basis
>> rather than as a whole node. Slurm's documentation is pretty lacking
>> though - there's lots of it but they tend to just focus on stupidly
>> complicated and arcane examples so it can be a pain to figure out.
>>
>> Torque I have not used but others have reported success with it - there is
>> a minor issue in that the Maui scheduler doesn't properly understand GPUs
>> even though torque does so you can't run with Maui unless you are happy
>> allocating at node rather than GPU granularity.
>>
>> Hope that helps. Others can hopefully give you more detailed info.
>>
>> All the best
>> Ross
>>
>>
>>
>>
>>
>> On 3/18/13 6:35 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>>
>>> Hello,
>>>
>>> I am aiming to set up a free GPU job scheduling solution in order to
>>> distribute (Amber) GPU computing jobs among various nodes containing
>>> CUDA devices. However, the corresponding resources in the web are still
>>> scarce. When searching the web for the term "GPU job scheduling" then
>>> this short list hosted by Nvidia is the most informative result:
>>> https://developer.nvidia.com/job-scheduling
>>>
>>> I am currently looking into setting up Torque which "supports" NVIDIA
>>> GPUs:
>>>
>> http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/3-nodes/NVID
>>> IAGPGPUs.htm
>>>
>>> However, before proceeding, I would be very interested to hear about the
>>> experiences others have made.
>>>
>>> So, if you have set up a GPU cluster with proper job management based on
>>> free software, then I would be happy to read about the scheduler
>>> software of your choice, complications you ran into, and other
>>> experiences you find worth mentioning. Sure that this would help not
>>> only me!
>>>
>>>
>>> Thanks a lot,
>>>
>>> Jan-Philip
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Apr 05 2013 - 08:00:02 PDT