Re: [AMBER] performance of pmed.cuda.MPI from Ross Walker on 2012-09-21 (Amber Archive Sep 2012)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 21 Sep 2012 09:56:27 -0700

Hi Tru,

>>MPI performance of GTX 690 is abysmal because the two GPUs share the same
>> PCIEX adaptor.
>>
>> That will improve down the road somewhat.
>>
>> In the meantime, I think you'll be happy at the performance of two
>> independent runs (one on each GPU): 98+% efficiency when I last
>>checked...
>
>If I understand you correctly, with a 4 PCI-E motherboard, and 4x GTX 690,
>one should run 8 independant pmemd.cuda (non MPI) to get the maximum
>throughput.

Yes with caveats.

Firstly, when you run single GPU AMBER the entire calculation is run on
the GPU and communication over the PCI-E bus only occurs for I/O. Thus if
you set NTPR and NTWX high enough, typically >= 1000 then the PCI-E speed
will have little impact on the performance. Additionally as long as you
select the GPUs correctly so you make sure each individual job runs on a
different physical GPU then the performance impact of each successive
single GPU job will be minimal. Performance decrease occurs mostly because
of contention for I/O resources. Thus PCI-E x8 is reasonable for single
GPU jobs. x4 might be cutting it too fine but would still work reasonably
well if you don't do I/O too frequently.

Now on to parallel jobs. When you want to run a job across multiple GPUs
then it is necessary for information to be sent between each GPU on every
time step irrespective of whether I/O is being done. This makes the PCI/E
bandwidth critical and a major bottleneck. x16 was marginal back in the
days of C1060 / C2050 cards. Now we have cards that are almost double the
speed and we are still at x16 speed - clearly this is FAIL! Then it gets
worse, with the K10 and GTX690 there are two GPUs on the same board,
although for all intents and purposes they are really two distinct GPUs
that are essentially jammed into the same PCI-E slot. The bandwidth for
each GPU is thus x8 which is woefully inadequate for running AMBER in
parallel across multiple GPUs. When you use both GPUs on a single K10 or
GTX690 they still share the PCI-E bus so it is like having two cards each
in X8 slots hence it doesn't help in parallel. If there was a 'real'
interconnect between the two GPUs then it would be interesting but these
aren't, they are just two GPUs each one on half of the PCI-E connection.
The K10's scale a little better than the GTX690s but that's just because
the GPUs themselves are slower and so the performance to bandwidth ratio
is a little better. If you measure absolute performance though there is no
free lunch there.

Now onto putting 4 GTX 690s in the same box. I have not tried this and I
don't know of any vendor selling them. 4 x GTX680 is no problem. The issue
with 4 x 690s is you have 8 physical GPUs per box and there are VERY few
motherboards that have bios's that can support 8 physical GPUs. The Tyan 8
way M2090s took a LOT of work to get to work, including substantial
hacking of the bios. The issue is that the physical address space is just
64K (a hard limit imposed by the legacy x86 architecture). Each GPU uses
around 4K of IO space so 8 GPUs needs half of the total IO space which
assumes everything else on the motherboard, NIC cards, hard drive
controllers etc is being very economical and well behaved. On consumer
boards this is unlikely and so I'd be very surprised if you can get 4
GTX690s in a regular board. You probably need to go for multi socket
specialist super micro or than boards which can be VERY expensive (not to
mention the CPU costs). So it is generally much more cost effective to
built 2 nodes by 2 GTX690s each. You might be able to get away with 3
GTX690s in one board although I don't know anybody who has tried it and it
will run VERY hot.

Power, you probably need 2 x 1.2KW independent power supplies for 4
GTX690s as well which will make the case expensive.

>
>The GTX-690 is seen as 2 nvidia devices that are adressed independantly?

Yes, for all intents and purposes consider them to be 2 physical GPUs
jammed in a single PCI-E x16 slot sharing the fan.

>In order to get a better pmemd.cuda.MPI scaling, that one needs to only
>target
>one of the 2 GPUS on each PCI-E for each run? How does that behave for
>independant
>pmemd.cuda.MPI simulations? Do the shared PCI-E become the bottleneck?
>Bottom line, are multiple GTX-690 in the same server worth it? or should
>one stay with the regular GTX-680?

Using only one of the GTX690 GPUs on each board can help. E.g. if you use
one GPU from each of two boards then they will get x16 bandwidth each and
the parallel scaling will improve. But you will be leaving half the GPUs
idle. You can't run 2 x 2 GPU jobs split over two boards since this puts
you back with x8 bandwidth for each GPU. The GTX680s don't scale very well
in parallel because they are so damn fast individually and the PCI-E X16
bandwidth can't keep up. Hence until we get x32 being ubiquitous and
feeding all GPUs at that speed it is better to focus on single GPU runs in
which case it is a close call between GTX690s and 680s since the 690s are
about twice the price of the 680s so you don't get any extra hardware for
free. Thus if you can get 690s at a discount compared to 2 680s then it is
probably worth going with the 690s. Unless you have other constraints like
space etc. Given nobody should be running single MD runs these days and
trying to draw conclusions (in the lab we work with MOLs of molecules
remember) so it isn't critical that with 4 GPUs in a machine the optimum
way to run them is 4 independent MD simulations.

Here's an example hardware shopping list for building your own 2 x GTX680
or 2 x GTX690 machines. We have built several of these and they work great.

http://www.rosswalker.co.uk/current_amber_gpu_spec.htm

Hope that helps.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 21 2012 - 10:00:02 PDT