Re: [AMBER] README under pmemd/src?

From: Jason Swails <>
Date: Wed, 20 Nov 2013 13:14:21 -0500

On Wed, 2013-11-20 at 09:21 -0800, yunshi11 . wrote:
> Hi Ross,
> On Tue, Nov 19, 2013 at 10:35 AM, Ross Walker <> wrote:
> > Hi Yun,
> >
> >
> > What do you mean by 48 CPUs + 12 GPUs? - Do you mean you are trying to run
> > pmemd.cuda.MPI across 48 cores connected to 12 GPUs? - For starters things
> > like block_fft only apply to CPU runs - they are meaningless in GPU runs.
> > Secondly I would suggest reading the following page:
> > which will explain how to run AMBER GPU runs.
> > Essentially 48 CPUs + 12 GPUs does not make sense and even if this was 12
> > Cores + 12 GPUs the calculation would be unlikely to scale unless it was a
> > replica exchange run.
> >
> >
> Our cluster has some 12-core nodes (2 x 6-core Intel E5649) that have 3
> general-purpose GPUs (NVIDIA Tesla M2070s) each, which results in this 4:1
> CPU(core):GPU ratio.
> Reading through the link, it seems to me that a 1:1 ratio would be better?
> When running
> *pmemd.cuda.MPI*, the number of tasks/threads depend on the number of CPUs?
> And it is better to assign only one task to each GPU?
> But why do 12 cores + 12 GPUs NOT scale? Because 12 is not a power of 2?

It's because the code is simply not that scalable. The communication
time is long compared to the calculation time, so you spend too much
time waiting for data exchange. Coupled with the fact that the CUDA FFT
implementation only runs on 1 GPU puts an upper limit on the scalability
of simulations in explicit solvent on the GPU. If the FFT takes 1/8 of
the time, the best you can hope to do is scale to 8 GPUs (with one
spending all of its time on the FFT).

While the next version will have better scaling than Amber 12, it will
still be somewhat limited.

> I also noticed that the "AMBER Certified Mid-Level Workstation" has 2x
> (6-core) Intel Xeon E5-2620 with 2x NVIDIA GTX 780 GPUs, which would make
> CPU(core):GPU ratio at 6:1?

Every thread in pmemd.cuda.MPI runs on a GPU. The CPU is only used for
mundane tasks, like file input/output and shouting 'directions' at the
GPU -- the GPU does _all_ of the work. You typically want more CPUs
than GPUs because you don't want to oversubscribe the CPUs. However, if
you try to use more CPUs than you have GPUs for pmemd.cuda.MPI, then it
significantly slow down the calculation (or kill it completely) by
trying to run multiple threads on 1 GPU.

Just because you have 12 CPUs doesn't mean you have to use them all on
the same job ;).


Jason M. Swails
Rutgers University
Postdoctoral Researcher
AMBER mailing list
Received on Wed Nov 20 2013 - 10:30:02 PST
Custom Search