Re: [AMBER] Protocol for multiple CPU+ single GPU run on a single node, from Ross Walker on 2014-05-14 (Amber Archive May 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 14 May 2014 10:39:22 -0700

Not strictly true.

pmemd.cuda.MPI is there to facilitate multi-GPU runs either on different
nodes (not recommended) or within the same node.

E.g. suppose you have a system with 2 GPUs in it. You could do either:

cd run1
export CUDA_VISIBLE_DEVICES=0
nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &
cd ../run2
export CUDA_VISIBLE_DEVICES=1
nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &

And BOTH calculations will run at full speed (using a total of 2 of your
CPU cores). This is different from a lot of other codes which have
contention here since they rely on PCI-E communication on every step since
they use the CPU cores as well.

Or you could do:

cd run1
export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ...
cd ../run2
mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ...

This will take longer than two single GPU runs since scaling to multiple
GPUs for a single run is far from linear BUT if you want to get run1
completed as quickly as possible this works.

Note if you are using AMBER14 and your two GPUs can talk to each other via
peer to peer (should be able to if they are on the same IOH controller /
physical CPU socket) and you have true PCI-E gen 3 x16 bandwidth to each
then you should see very good multi-GPU performance.

If you have 4 GPUs (you'd need a two socket system right now for this to
be full bandwidth) then you could run 2 x 2 GPU runs at the same time with
AMBER 14, one using 0 and 1 and one using 2 and 3. - Assuming this matches
with how they talk to each other over peer to peer. Or 4 x 1 GPU or 2 x 1
GPU and 1 x 2 GPU. Currently no production motherboard supports 4 way peer
to peer yet but when they do the code should scale well to 4 GPUs.

Multi-node is a bad idea for things other than REMD and other loosely
coupled stuff with GPUs right now because interconnect bandwidth has sadly
not kept up with GPU improvements so modern GPUs (K40, GTX-Titan-Black
etc) are too fast for the interconnect.

For now what is on http://ambermd.org/gpus/ for running in parallel
applies to AMBER 12 (even though it is on the AMBER 14 page) - I have not
had a chance to update it yet. I am just finalizing a short piece of code
that will test which GPUs can communicate via peer to peer in a node so
one knows what to set CUDA_VISIBLE_DEVICES to and then I'll update that
section.

In terms of performance - see http://ambermd.org/gpus/benchmarks.htm for
updated numbers with AMBER 14. From my experience if you run like for like
simulations with gromacs (that is NOT doing crazy things like only
updating the pair list every 20 steps and other such hacks) then I think
you will find that AMBER on a single GPU beats Gromacs on two GPUs - and
add to that the cumulative performance running two single GPU jobs one on
each GPU then it wins hands down. For raw throughput on a single job using
two GPUs AMBER 14 should be faster, from the testing I have done trying to
run identical calculations, than any other MD code right now on the same
hardware.

And you still get your remaining CPU cores free to run some QM/MM or other
such calculation on. Bonus! ;-)

Hope that helps. Sorry the instructions on the website are not current - I
am trying to get it done as quickly as possible.

All the best
Ross

On 5/14/14, 10:13 AM, "MURAT OZTURK" <murozturk.ku.edu.tr> wrote:

>To clarify, pmemd.cuda.MPI is only there to facilitate multi GPU runs when
>GPUs are on different nodes then?
>
>This is very different than gromacs where I can do multi cpu + multi gpu.
>I
>wonder how the performance will compare.
>
>
>On Wed, May 14, 2014 at 6:57 PM, Ross Walker <ross.rosswalker.co.uk>
>wrote:
>
>> To add to Jason's answer - you can of course use the remaining 19 CPUs
>> (make sure there are really 20 cores in your machine and not 10 cores +
>>10
>> hyperthreads) for something else while the GPU run is running.
>>
>> cd GPU_run
>> nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &
>> cd ../CPU_run
>> nohup mpirun -np 19 $AMBERHOME/bin/pmemd.MPI -O -i ... &
>>
>> All the best
>> Ross
>>
>>
>> On 5/14/14, 8:17 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>
>> >On Wed, 2014-05-14 at 17:49 +0300, MURAT OZTURK wrote:
>> >> I will be running on a single node with 20 cpus and 1 gpu installed.
>> >>
>> >> Do I have to use pmemd.cuda.MPI for this, or is pmemd.cuda enough..?
>> >>
>> >> How do I specify the number of cpus used with pmemd.cuda? I can't
>>seem
>> >>to
>> >> find this information in the manual.
>> >
>> >Just pmemd.cuda. The thing about pmemd.cuda is that it runs the
>> >_entire_ calculation on the GPU, so adding CPUs buys you nothing.
>> >
>> >The way it is designed, each CPU thread will launch a GPU thread as
>>well
>> >(so you are stuck using 1 CPU for each GPU).
>> >
>> >HTH,
>> >Jason
>> >
>> >--
>> >Jason M. Swails
>> >BioMaPS,
>> >Rutgers University
>> >Postdoctoral Researcher
>> >
>> >
>> >_______________________________________________
>> >AMBER mailing list
>> >AMBER.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 14 2014 - 11:00:02 PDT