Re: [AMBER] Amber 14 Performance and Other Questions from Ross Walker on 2015-01-12 (Amber Archive Jan 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 12 Jan 2015 11:53:46 -0800

> On Jan 12, 2015, at 11:11 AM, Novosielski, Ryan <novosirj.ca.rutgers.edu> wrote:
>
> Thanks. That page did contain the answer, which was the following:
>
> "In parallel considerations change to the available bandwidth in the node (attempting to run across nodes is not recommended). With AMBER 14 the ideal specification for performance is 2 or 4 GPUs per node all in PCI-E Gen 3 x16 slots (or better). AMBER 14 uses peer to peer communication to provide optimum multi-GPU scaling. At the time of writing no motherboards exist that support more than two way peer to peer (but we have a unique custom-built system from CirraScale that supports 4-way simulations).”
>
> …my understanding is that that is out of date though as there appear to be custom machines available that do 8-way peer-to-peer.
>

Not that support non-blocking x16 bandwidth. For example the 8 way boxes that Exxact and Colefax offer are 4 by x16 channels, two from each CPU - each of those x16 channels feeds a 48 channel PCI-E switch - off of which hangs two x16 slots. So if you put 8 cards in such a box you have the following

                             -- x16 -- GPU 0
      -- x16 -- PLX switch --
                             -- x16 -- GPU 1
CPU 1
                             -- x16 -- GPU 2
      -- x16 -- PLX switch --
                             -- x16 -- GPU 3

                             -- x16 -- GPU 4
      -- x16 -- PLX switch --
                             -- x16 -- GPU 5
CPU 2
                             -- x16 -- GPU 6
      -- x16 -- PLX switch --
                             -- x16 -- GPU 7

So, GPU 0 & 1, 2 & 3, 4 & 5, 6 & 7 can talk to each other via peer to peer at full speed here.

GPUs on the same CPU can also talk via peer to peer. For example GPU 4 can talk to GPU 7 via peer to peer - but it is only x16 speed if GPU 5 and GPU 6 are not talking at the same time - if they are then there is effectively x8 speed between the two banks of GPUs.

GPU 0,1,2 and 3 cannot talk to GPU 4,5,6 or 7 here without going through the CPU chipset which kills peer to peer as well as bandwidth an latency.

If you know of any systems that support nonblocking peer to peer across 4 GPUs, other than the CirraScale system which uses an 80 channel PLX switch then I'd love to hear about it.

> I am curious what the reason is for the second process that runs on the GPU on Amber 14 (eg. if you run a 2 GPU MPI job, you will get 2 GPU processes per GPU, for a total of 4).

Huh? What do you mean here? - There should be just one process per GPU. If there are more then you are doing something wrong.

All the best
Ross

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jan 12 2015 - 12:00:04 PST