Re: [AMBER] pmemd.cuda.MPI vs openmpi from Ross Walker on 2015-06-03 (Amber Archive Jun 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 3 Jun 2015 11:31:48 -0700

Hi Victor,

Do not attempt to run regular GPU MD runs across multiple nodes. Infiniband is way too slow these days to keep up with the computation speed of the GPUs. The only types of simulation that you can run over multiple nodes with GPUs are loosely coupled runs such as those based on Replica exchange approaches.

In terms of using more than one GPU within a node for a single MD run it is crucial that they can communicate via peer to peer over the PCI-E bus. Having to go through the CPU chipset (which is what happens when they can't talk via Peer to Peer is also too slow these days). In terms of CPU counts for multi-GPU runs - the CPU is used purely to control the GPU - as such running with -np 16 does not help - it actually runs 16 GPU 'instances' which end up as 8 on each of your GPUs which really slows things down. We could have taken the NAMD / Gromacs approach of only offloading part of the calculation to the GPU and using the CPUs for the remainder but the net result is you actually end up slower overall than just taking the 'everything on the GPU approach' and leaving the excess CPUs idle. That said you can use those CPUs for other jobs. E.g.

export CUDA_VISIBLE_DEVICES=0
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.0 -o mdout.0 ... &
export CUDA_VISIBLE_DEVICES=1
nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.1 -o mdout.1 ... &
nohup mpirun -np 14 $AMBERHOME/bin/pmemd.MPI -O -i mdin.2 -o mdout.2 ... &

So the CPUs are not entirely wasted - although this takes a carefully crafted scheduler on a cluster.

On terms of using the 2 GPUs at the same time the correct command line is, for your 2 GPU case:

export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...

The issue is that without P2P it is impossible to get speedup (for non-GB calculations) over multiple GPUs. In this case the best you can do is run two GPU runs, one on each GPU as above.

Is there a reason you cannot build the check_p2p code? It's real simple - I'd be shocked if the cluster did not have make and nvcc installed. How would anyone compile their code for it? - how did they compile AMBER 14?

One thing you can quickly try is running lspci | grep nvidia on one of the nodes. E.g.

[root.GTX_TD ~]# lspci | grep NVIDIA
02:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
03:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
82:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
83:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)

Here you get the bus numbers that the GPUs are connected to. In this case there are 4 GPUs. one on bus 02, one on bus 03, one on bus 82 and one on bus 83. You can then run 'lspci -t -v' to get a full bus connectivity listing. In this case (pulling out the bits relevant to the GPUs) we have:

+-[0000:80]-+-00.0-[81]--+-00.0 Intel Corporation I350 Gigabit Network Connection
| | \-00.1 Intel Corporation I350 Gigabit Network Connection
| +-02.0-[82]--+-00.0 NVIDIA Corporation GM204
| | \-00.1 NVIDIA Corporation Device 0fbb
| +-03.0-[83]--+-00.0 NVIDIA Corporation GM204
| | \-00.1 NVIDIA Corporation Device 0fbb
| +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0

and

\-[0000:00]-+-00.0 Intel Corporation Xeon E5 v3/Core i7 DMI2
             +-01.0-[01]--
             +-02.0-[02]--+-00.0 NVIDIA Corporation GM204
             | \-00.1 NVIDIA Corporation Device 0fbb
             +-03.0-[03]--+-00.0 NVIDIA Corporation GM204
             | \-00.1 NVIDIA Corporation Device 0fbb
             +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0

So you see here that the 4 GPUs are in two groups, one set of two on one bus (connected to one of the CPU sockets) and the other set of two on the other bus connected to the other CPU socket. GPUs here can only communicate via P2P if they are on the same PCI bus. So here GPUs 0 and 1 can do P2P and 2 and 3 can do P2P but the combinations 0-2,0-3,1-2,1-3 are not supported.

In the case of your system I suspect that they placed one GPU on one bus and one GPU on the other bus - this is about the worst combination you can make for having two GPUs in the same node. If this is the case then you need to ask the administrators to please physically move one of the GPUs to a different PCI-E slot such that they are both connected to the same physical CPU socket.

Confusing and annoying but unfortunately a complexity that most people building clusters these days don't consider.

Hope that helps.

All the best
Ross

> On Jun 3, 2015, at 10:54 AM, Victor Ma <victordsmagift.gmail.com> wrote:
>
> Hello Amber community,
>
> I am testing my amber14 on a gpu cluster with IB. I noticed that when I
> turn on openmpi with pmemd.cuda.MPI, it actually slows things down.
> On single node, I have two gpus and 16 cpus. If I submit a job using
> "pmemd.cuda.MPI -O -i .....", one gpu is 99% used and P2P support is on.
> For my big system, I am getting ~27ns/day. If I turn on openmpi and use
> this instead "export CUDA_VISIBLE_DEVICES=0,1 then mpirun -np 2
> pmemd.cuda.MPI -O -i ....", two gpus are 77% used each but P2P is OFF. In
> this case, I am getting 33 ns/day. It is faster but I suspect that it could
> be even faster if the P2P is on. The other thing I tried is to run "mpirun
> -np 16 pmemd.cuda.MPI -O -i ....". Here the run is slowed down to 14ns/day.
> One GPU is used and all 16 cpus are used. Again p2p is off.
>
> I downloaded the check_p2p scripts. But as I am working on a cluster and
> therefore do could not run "make".
>
> I am pretty happy with the speed I am getting but also wondering if the
> configuration can be further optimized to improve performance, eg running
> on 2gpus 100% with P2P on.
>
>
> Thank you!
>
>
> Victor
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 03 2015 - 12:00:03 PDT