Re: [AMBER] Protocol for multiple CPU+ single GPU run on a single node, from MURAT OZTURK on 2014-05-14 (Amber Archive May 2014)

From: MURAT OZTURK <murozturk.ku.edu.tr>
Date: Wed, 14 May 2014 21:35:26 +0300

Oh so very exciting. I never quite grasped how fast the gpus are and how
slow the bus is, appearntly.

Since we are already slightly off-topic, I dare to ask if anybody has any
experience with acemd? They claim to have a state-of-the art pure GPU
implementation. Do we know if it is principally different than the amber
implementation?

Regards,

Murat

On Wed, May 14, 2014 at 9:10 PM, Scott Le Grand <varelse2005.gmail.com>wrote:

> I am occasionally tempted to write a multithreaded SSE/AVX version of PME
> for crazy people, but then I wake up from the nightmare and I do something
> useful instead.
>
>
>
>
> On Wed, May 14, 2014 at 10:39 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Not strictly true.
> >
> > pmemd.cuda.MPI is there to facilitate multi-GPU runs either on different
> > nodes (not recommended) or within the same node.
> >
> > E.g. suppose you have a system with 2 GPUs in it. You could do either:
> >
> > cd run1
> > export CUDA_VISIBLE_DEVICES=0
> > nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &
> > cd ../run2
> > export CUDA_VISIBLE_DEVICES=1
> > nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &
> >
> > And BOTH calculations will run at full speed (using a total of 2 of your
> > CPU cores). This is different from a lot of other codes which have
> > contention here since they rely on PCI-E communication on every step
> since
> > they use the CPU cores as well.
> >
> > Or you could do:
> >
> > cd run1
> > export CUDA_VISIBLE_DEVICES=0,1
> > mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ...
> > cd ../run2
> > mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ...
> >
> > This will take longer than two single GPU runs since scaling to multiple
> > GPUs for a single run is far from linear BUT if you want to get run1
> > completed as quickly as possible this works.
> >
> >
> > Note if you are using AMBER14 and your two GPUs can talk to each other
> via
> > peer to peer (should be able to if they are on the same IOH controller /
> > physical CPU socket) and you have true PCI-E gen 3 x16 bandwidth to each
> > then you should see very good multi-GPU performance.
> >
> > If you have 4 GPUs (you'd need a two socket system right now for this to
> > be full bandwidth) then you could run 2 x 2 GPU runs at the same time
> with
> > AMBER 14, one using 0 and 1 and one using 2 and 3. - Assuming this
> matches
> > with how they talk to each other over peer to peer. Or 4 x 1 GPU or 2 x 1
> > GPU and 1 x 2 GPU. Currently no production motherboard supports 4 way
> peer
> > to peer yet but when they do the code should scale well to 4 GPUs.
> >
> > Multi-node is a bad idea for things other than REMD and other loosely
> > coupled stuff with GPUs right now because interconnect bandwidth has
> sadly
> > not kept up with GPU improvements so modern GPUs (K40, GTX-Titan-Black
> > etc) are too fast for the interconnect.
> >
> > For now what is on http://ambermd.org/gpus/ for running in parallel
> > applies to AMBER 12 (even though it is on the AMBER 14 page) - I have not
> > had a chance to update it yet. I am just finalizing a short piece of code
> > that will test which GPUs can communicate via peer to peer in a node so
> > one knows what to set CUDA_VISIBLE_DEVICES to and then I'll update that
> > section.
> >
> > In terms of performance - see http://ambermd.org/gpus/benchmarks.htm for
> > updated numbers with AMBER 14. From my experience if you run like for
> like
> > simulations with gromacs (that is NOT doing crazy things like only
> > updating the pair list every 20 steps and other such hacks) then I think
> > you will find that AMBER on a single GPU beats Gromacs on two GPUs - and
> > add to that the cumulative performance running two single GPU jobs one on
> > each GPU then it wins hands down. For raw throughput on a single job
> using
> > two GPUs AMBER 14 should be faster, from the testing I have done trying
> to
> > run identical calculations, than any other MD code right now on the same
> > hardware.
> >
> > And you still get your remaining CPU cores free to run some QM/MM or
> other
> > such calculation on. Bonus! ;-)
> >
> > Hope that helps. Sorry the instructions on the website are not current -
> I
> > am trying to get it done as quickly as possible.
> >
> > All the best
> > Ross
> >
> >
> > On 5/14/14, 10:13 AM, "MURAT OZTURK" <murozturk.ku.edu.tr> wrote:
> >
> > >To clarify, pmemd.cuda.MPI is only there to facilitate multi GPU runs
> when
> > >GPUs are on different nodes then?
> > >
> > >This is very different than gromacs where I can do multi cpu + multi
> gpu.
> > >I
> > >wonder how the performance will compare.
> > >
> > >
> > >On Wed, May 14, 2014 at 6:57 PM, Ross Walker <ross.rosswalker.co.uk>
> > >wrote:
> > >
> > >> To add to Jason's answer - you can of course use the remaining 19 CPUs
> > >> (make sure there are really 20 cores in your machine and not 10 cores
> +
> > >>10
> > >> hyperthreads) for something else while the GPU run is running.
> > >>
> > >> cd GPU_run
> > >> nohup $AMBERHOME/bin/pmemd.cuda -O -i ... &
> > >> cd ../CPU_run
> > >> nohup mpirun -np 19 $AMBERHOME/bin/pmemd.MPI -O -i ... &
> > >>
> > >> All the best
> > >> Ross
> > >>
> > >>
> > >> On 5/14/14, 8:17 AM, "Jason Swails" <jason.swails.gmail.com> wrote:
> > >>
> > >> >On Wed, 2014-05-14 at 17:49 +0300, MURAT OZTURK wrote:
> > >> >> I will be running on a single node with 20 cpus and 1 gpu
> installed.
> > >> >>
> > >> >> Do I have to use pmemd.cuda.MPI for this, or is pmemd.cuda
> enough..?
> > >> >>
> > >> >> How do I specify the number of cpus used with pmemd.cuda? I can't
> > >>seem
> > >> >>to
> > >> >> find this information in the manual.
> > >> >
> > >> >Just pmemd.cuda. The thing about pmemd.cuda is that it runs the
> > >> >_entire_ calculation on the GPU, so adding CPUs buys you nothing.
> > >> >
> > >> >The way it is designed, each CPU thread will launch a GPU thread as
> > >>well
> > >> >(so you are stuck using 1 CPU for each GPU).
> > >> >
> > >> >HTH,
> > >> >Jason
> > >> >
> > >> >--
> > >> >Jason M. Swails
> > >> >BioMaPS,
> > >> >Rutgers University
> > >> >Postdoctoral Researcher
> > >> >
> > >> >
> > >> >_______________________________________________
> > >> >AMBER mailing list
> > >> >AMBER.ambermd.org
> > >> >http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > >_______________________________________________
> > >AMBER mailing list
> > >AMBER.ambermd.org
> > >http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 14 2014 - 12:00:03 PDT