Re: [AMBER] Running amber v11 over multiple gpus/nodes from Ross Walker on 2011-09-14 (Amber Archive Sep 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 14 Sep 2011 10:55:03 -0700

Hi Peter,

> Thank you. That really clears things up for me. The technology document
> is particularly good and sets out things (re CUDA v4) really well. The
> parallel speed up of this benchmark over 4 gpus isn't that great (about
> 8 minutes to run the simulation vs 11.5 minutes on 2 gpus), however I
> suspect that it is about as good as it gets at the moment. On the other
> hand, looking at the bigger picture, this is pretty good.

Note the GPU Direct use in AMBER right now is GPU Direct v1 which is the use
of pinned memory for MPI send and receives. We have not made use of the CUDA
v4 (GPU Direct v2) features because of the limitations, in particular on
dual IOH chips which almost all dual socket machines people are building
have right now - since everyone wants to put 4 or more GPUs in a node. Once
DMA GPU to GPU and GPU to IB is fully supported so the code doesn't have to
be overly complicated and fragile to deal with all the exceptions and
various system configurations then we plan to fully exploit it. This will
help some with the parallel scaling. Ultimately though the GPUs are just
totally starved of interconnect bandwidth. If we'd made the initial single
GPU performance very poor then we would be able to show great scaling but
that is a typical 'FloPy' HPC metric approach that drives me crazy!

PCI-Gen3 and FDR IB should help with things, as long as people don't go
putting 4 or 8 GPUs in a node with single IB adapter and expect miracles,
although we also need a good multi-GPU FFT implementation. At the moment the
FFT is done just on GPU 0 and takes about 1/7th of the simulation time so
limits the scaling to a maximum of 8 GPUs.

Our goal is something on the order of half a microsecond a day or so for the
JAC Production benchmark although how long it takes to achieve this (and all
the extra features we plan to add) depends on whether the NSF-SI2-SSE grant
Adrian Roitberg and I have to fund this work, that ends Sept 30th, gets
renewed or not.

> Here are some benchmarking figures for the Amber
> PME/Cellulose_production_NPT benchmark on our gpu hardware:
>
> # Benchmarking results
> Conventional hardware, 8 cpus -- 4881s
> Conventional parallel on 16 cpus -- 2679s
>
> Cuda.pmemd, serial -- 961s
> Cuda.pmemd.MPI, 2 gpus -- 694s
> Cuda.pmemd.MPI, 4 gpus -- 524s

It would be useful to see these timings as ns/day numbers. Then they would
be directly comparable with the benchmarks here:
http://ambermd.org/gpus/benchmarks.htm#Benchmarks and we could see if you
are getting the performance you should be.

Note if you have not yet turned off ECC on these GPUs you should since it
both boosts the performance in serial AND improves parallel scaling (and
gives you more usable GPU memory to boot) :-)

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 14 2011 - 11:00:04 PDT