Re: [AMBER] performance of pmed.cuda.MPI from Jonathan Gough on 2012-09-21 (Amber Archive Sep 2012)

From: Jonathan Gough <jonathan.d.gough.gmail.com>
Date: Fri, 21 Sep 2012 13:09:11 -0400

ON a slightly related note:

When running gpu calculations, has anyone looked at or have an idea of the
effect that the CPU speed has? For example -- if you were to compare
machines that were identical (same motherboard, hard drive, RAM, and a
680X) but had different CPU's would you see a performance bump/drop or is
it still a level playing field? Can you get by with an i3 or i5 instead of
a top end i7? (could be another way to save $ if your building on a budget)

any thoughts or insight?

Thanks,
JOnathan

On Fri, Sep 21, 2012 at 12:56 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Tru,
>
>
> >>MPI performance of GTX 690 is abysmal because the two GPUs share the same
> >> PCIEX adaptor.
> >>
> >> That will improve down the road somewhat.
> >>
> >> In the meantime, I think you'll be happy at the performance of two
> >> independent runs (one on each GPU): 98+% efficiency when I last
> >>checked...
> >
> >If I understand you correctly, with a 4 PCI-E motherboard, and 4x GTX 690,
> >one should run 8 independant pmemd.cuda (non MPI) to get the maximum
> >throughput.
>
> Yes with caveats.
>
> Firstly, when you run single GPU AMBER the entire calculation is run on
> the GPU and communication over the PCI-E bus only occurs for I/O. Thus if
> you set NTPR and NTWX high enough, typically >= 1000 then the PCI-E speed
> will have little impact on the performance. Additionally as long as you
> select the GPUs correctly so you make sure each individual job runs on a
> different physical GPU then the performance impact of each successive
> single GPU job will be minimal. Performance decrease occurs mostly because
> of contention for I/O resources. Thus PCI-E x8 is reasonable for single
> GPU jobs. x4 might be cutting it too fine but would still work reasonably
> well if you don't do I/O too frequently.
>
> Now on to parallel jobs. When you want to run a job across multiple GPUs
> then it is necessary for information to be sent between each GPU on every
> time step irrespective of whether I/O is being done. This makes the PCI/E
> bandwidth critical and a major bottleneck. x16 was marginal back in the
> days of C1060 / C2050 cards. Now we have cards that are almost double the
> speed and we are still at x16 speed - clearly this is FAIL! Then it gets
> worse, with the K10 and GTX690 there are two GPUs on the same board,
> although for all intents and purposes they are really two distinct GPUs
> that are essentially jammed into the same PCI-E slot. The bandwidth for
> each GPU is thus x8 which is woefully inadequate for running AMBER in
> parallel across multiple GPUs. When you use both GPUs on a single K10 or
> GTX690 they still share the PCI-E bus so it is like having two cards each
> in X8 slots hence it doesn't help in parallel. If there was a 'real'
> interconnect between the two GPUs then it would be interesting but these
> aren't, they are just two GPUs each one on half of the PCI-E connection.
> The K10's scale a little better than the GTX690s but that's just because
> the GPUs themselves are slower and so the performance to bandwidth ratio
> is a little better. If you measure absolute performance though there is no
> free lunch there.
>
> Now onto putting 4 GTX 690s in the same box. I have not tried this and I
> don't know of any vendor selling them. 4 x GTX680 is no problem. The issue
> with 4 x 690s is you have 8 physical GPUs per box and there are VERY few
> motherboards that have bios's that can support 8 physical GPUs. The Tyan 8
> way M2090s took a LOT of work to get to work, including substantial
> hacking of the bios. The issue is that the physical address space is just
> 64K (a hard limit imposed by the legacy x86 architecture). Each GPU uses
> around 4K of IO space so 8 GPUs needs half of the total IO space which
> assumes everything else on the motherboard, NIC cards, hard drive
> controllers etc is being very economical and well behaved. On consumer
> boards this is unlikely and so I'd be very surprised if you can get 4
> GTX690s in a regular board. You probably need to go for multi socket
> specialist super micro or than boards which can be VERY expensive (not to
> mention the CPU costs). So it is generally much more cost effective to
> built 2 nodes by 2 GTX690s each. You might be able to get away with 3
> GTX690s in one board although I don't know anybody who has tried it and it
> will run VERY hot.
>
> Power, you probably need 2 x 1.2KW independent power supplies for 4
> GTX690s as well which will make the case expensive.
>
> >
> >The GTX-690 is seen as 2 nvidia devices that are adressed independantly?
>
> Yes, for all intents and purposes consider them to be 2 physical GPUs
> jammed in a single PCI-E x16 slot sharing the fan.
>
> >In order to get a better pmemd.cuda.MPI scaling, that one needs to only
> >target
> >one of the 2 GPUS on each PCI-E for each run? How does that behave for
> >independant
> >pmemd.cuda.MPI simulations? Do the shared PCI-E become the bottleneck?
> >Bottom line, are multiple GTX-690 in the same server worth it? or should
> >one stay with the regular GTX-680?
>
> Using only one of the GTX690 GPUs on each board can help. E.g. if you use
> one GPU from each of two boards then they will get x16 bandwidth each and
> the parallel scaling will improve. But you will be leaving half the GPUs
> idle. You can't run 2 x 2 GPU jobs split over two boards since this puts
> you back with x8 bandwidth for each GPU. The GTX680s don't scale very well
> in parallel because they are so damn fast individually and the PCI-E X16
> bandwidth can't keep up. Hence until we get x32 being ubiquitous and
> feeding all GPUs at that speed it is better to focus on single GPU runs in
> which case it is a close call between GTX690s and 680s since the 690s are
> about twice the price of the 680s so you don't get any extra hardware for
> free. Thus if you can get 690s at a discount compared to 2 680s then it is
> probably worth going with the 690s. Unless you have other constraints like
> space etc. Given nobody should be running single MD runs these days and
> trying to draw conclusions (in the lab we work with MOLs of molecules
> remember) so it isn't critical that with 4 GPUs in a machine the optimum
> way to run them is 4 independent MD simulations.
>
> Here's an example hardware shopping list for building your own 2 x GTX680
> or 2 x GTX690 machines. We have built several of these and they work great.
>
> http://www.rosswalker.co.uk/current_amber_gpu_spec.htm
>
> Hope that helps.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 21 2012 - 10:30:02 PDT