Re: [AMBER] Amber 12 on K20 GPUs -- how am I doing? from Ross Walker on 2014-12-16 (Amber Archive Dec 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 16 Dec 2014 10:15:36 -0800

Hi David,

Most of the 'latest technology' you refer to here is mostly marketing
fluff. It tends to be unreliable, a pain to use, does not work with ALL
GPUs (e.g. GeForce cards), requires specific MPI versions and only works
with certain interconnects etc. The performance gain you get tends to be
so minimal that it really isn't worth the hassle. Hence why we don't use
all these bells and whistles. With AMBER 14 we just gave up on multi-node
runs for things other than REMD etc since interconnects are expensive and
these days just not fast enough compared with the GPU speed. Hence we
moved over to peer to peer within a node which doesn't need any fancy CUDA
versions etc, lets you purchase very cost effective multi-GPU nodes and
save money by avoiding expensive interconnects and on almost all
motherboards supports 2 way runs and on some new hardware 4 way runs
efficiently.

With regards to AMBER 12 you won't be able to get much out of multi-GPU
runs and it really isn't worth the time optimizing the MPI etc since the
pay offs will be small - I'd suggest just sticking with regular
GCC/gfortran (since the intel compilers won't make any difference with GPU
runs) and stick to single GPU runs. Our simulations are always lacking in
sampling so you are almost always better running multiple independent runs
rather than trying to make a single run go a little bit faster. With AMBER
14 with multiple GPUs in a node the multi-GPU scaling is better so here if
you have say 4 GPUs in a box it's probably best running 2 x 2 GPU runs (or
4 x 1 GPU runs).

In terms of MVAPICH2-GDR etc - we don't use any of it. The recommendation
for CUDA 5.0 is actually because later versions of cuda are actually
slower, mostly because all this extra marketing fluff was added. At the
end of the day in AMBER 14 we just use PCI-E peer to peer copies - super
simple, super fast, and hardware agnostic assuming GPUs are connected to
the same GPU - the simple solutions are always the best. Now if only
someone would properly implement PCI-E broadcast as specified in the PCI-E
3.0 spec and we would be golden!

For multi-gpu runs I still use the 7 year old? basic MPICH 1.4.

All the best
Ross

On 12/16/14, 4:22 AM, "Baker D.J." <D.J.Baker.soton.ac.uk> wrote:

>Hello,
>
>At the moment I am doing some benchmarks with Amber 12 on our new K20 GPU
>cards. It is a pity that I do not have access to the latest Amber (v14)
>-- I am waiting for the site license and that is out of my hands. Amber
>12 does not work with CUDA 6*, however it does work with CUDA 5* and
>actually v5.0 is recommended by the developers. So, in other words, this
>Amber cannot take advantage of all the latest technology -- GPUDIRECT and
>all that. I'm just trying to get a feel for how I'm doing. In other words
>do the benchmark figures that I have quoted in this mail make sense?
>
>To do these benchmarks I decided to download a copy of the MVAPICH2
>source, and build it from scratch. We have OFED 2.1 and access to CUDA
>5.0, and as a start I decided to go for a fairly simple MVAPICH configure:
>
>./configure --with-cuda=/local/software/cuda/5.0.35
>--prefix=/local/software/mvapich2/2.0.1/intel-cuda5.0
>
>Once MVAPICH2 was built I then installed Amber 12, using the Intel
>compilers and MKL libraries, with CUDA/MPI support. The results were OK,
>however not really impressive. Actually I had earlier build the same
>Amber executable with OpenMPI (with CUDA support). We have 2 GPU K20
>cards per node. Here are the results of one of the Amber benchmarks
>(PME/Cellulose_production_NPT):
>
> Time(s) ns/day
>CUDA *1 393 4.47
>MVAPICH
>CUDA *2 323 5.47
>CUDA *4 286 6.27
>OpenMPI
>CUDA *2 324 5.46
>CUDA *4 254 7.15
>
>It is interesting to see that OpenMPI seems to perform pretty well these
>days. When I last tried this comparison at least a couple of years ago I
>found that OpenMPI performed terribly. It seems that the OpenMPI team
>have put a lot of work in to getting it "working with" GPU cards. Am I,
>however, correct in thinking that MVAPICH could still do better, and if
>so how would you recommend that I configure it for OFED/GPU support,
>please?
>
>On the other hand I am really keen to some Amber 14 benchmarks and try
>out CUDA 6* with MVAPICH2-GDR. It is a real pity that I am still waiting
>for a colleague in the chemistry department to sort out our site license.
>Does anyone, by any chance, have any results for Amber 14 plus all the
>latest GPU/CUDA technology, please?
>
>Best regards -- David.
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Dec 16 2014 - 10:30:02 PST