Hi All,
Note if you are running AMBER on a single node either interactively or
through a queuing system that has allocated you the entire node the
optimum approach is to make sure the GPUs are in compute exclusive mode.
The easiest way to do this is to add the following to /etc/rc.d/rc.local
on each node or have it run by root as part of the queuing system prologue.
nvidia-smi -pm 1
nvidia-smi -c 3
When in this mode you can just run as many amber jobs as you want and it
will automatically detect and utilize available GPUs. When all GPUs are in
use it will quit with a message to this effect. Thus say you have a node
with 4 GPUs and want to run 4 x 1 GPU jobs at a time that vary in their
run length. You can just use GNU parallels to run a large set of jobs and
instruct it to run 4 at a time - as a GPU becomes free so the next job
will automatically utilize that GPU. This also means the GPU ID reported
in mdout is correct.
This also works in parallel - if you run a multi-gpu job on a single node
it will automatically select available GPUs to use without overloading any.
Setting compute exclusive mode also stops you from accidentally over
subscribing a GPU.
If you still want to use CUDA_VISIBLE_DEVICES or your queuing system sets
it for you then I'm afraid the only way to tell which GPU AMBER runs on is
to echo $CUDA_VISIBLE_DEVICES before running AMBER and save the output in
a log file that you can refer to later. As far as AMBER is concerned it
has no way of knowing what the real physical ID was of the GPU it was
handed by the NVIDIA driver as determined by CUDA_VISIBLE_DEVICES. We
could probably add code to try and figure out the current environment
variable setting to print it within mdout but in my experience such
approaches tend to be very unportable between operating systems - e.g. it
probably doesn't run on Cray for example.
All the best
Ross
On 8/15/13 12:21 PM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>Hello Henk,
>
>you are right, setting CUDA_VISIBLE_DEVICES is task of the batch system,
>printing CUDA_VISIBLE_DEVICES should be part of the job.
>
>I am of the opinion that people should wrap their scientific jobs in
>self-contained shell scripts anyway, for job documentation and
>reproducibility purposes. W.r.t. Amber, in such a script I normally keep
>the following order:
>
>1) set up a certain Amber installation
>2) print some debugging info to stdout
>3) define and write the Amber input files
>4) run the number crunching part, printing the exact command used
>
>In part (2), I usually print the absolute path to certain executables,
>the hostname, as well as the value of certain environment variables,
>such as the jobid variables for various batch systems and also
>CUDA_VISIBLE_DEVICES if set.
>
>The information at hand for each job when applying this strategy has
>often helped me to identify issues in corners I would not have thought
>of otherwise.
>
>Cheers,
>
>Jan-Philip
>
>
>On 15.08.2013 20:47, Meij, Henk wrote:
>> The list archive has helped me puzzle the problem out of why Amber
>> always thinks the instance ID of any GPU is 0. However it would be
>> nice for users to know the actual instance ID so they can monitor
>> utilization % while their programs are running.
>>
>>
>>
>> That can be done by setting CUDA_VISIBLE_DEVICES inside the wrapper,
>> and just before the program starts echo to STDOUT the actual instance
>> ID. I have done that with Lava/Amber12 and it is document here:
>>
>>
>>
>> https://dokuwiki.wesleyan.edu/doku.php?id=cluster:119
>>
>>
>>
>> At the start of the STDOUT report users may now observe
>> nodeName:gpuID
>>
>> GPU allocation instance n36:2
>>
>> -Henk _______________________________________________ AMBER mailing
>> list AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 15 2013 - 14:00:03 PDT