Re: [AMBER] pmemd.cuda.MPI vs openmpi

From: Victor Ma <victordsmagift.gmail.com>
Date: Wed, 3 Jun 2015 13:29:13 -0700

Hello Ross,

I just heard back from the sysadmin:

"You are correct in that the 2 GPUs are connected to different PCI-E
nodes. This is a deliberate decision in that it balances aggregate I/O
capability between the GPUs and the x86 processors, as there are 2 CPUs.
As you have observed, this has the side-effect of disallowing GPUDirect
communication.

It is still possible to use 2 GPUs but your program must manage memory more
explicitly. Transfers would then come down into host memory, across the
inter-processor QPI channels, and into the second GPU. One possibility is
to use multiple OpenMP threads or MPI ranks, one for each GPU; designing
for the latter would allow your program to scale up to many nodes and GPUs."

I thought we are using OpenMP as I am running:
mpirun -np 2 pmemd.cuda.MPI -O ...

Anyway thank you so much!

Victor



On Wed, Jun 3, 2015 at 12:29 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Victor,
>
> Indeed the two different buses is the problem. Hopefully you can convince
> the sysadmin to physcially move one of the K20's to a different slot. This
> may or may not be possible depending on the design of the underlying nodes.
> I note they are K20m which is passively cooled which likely means some
> ducted node which likely was designed more with space (e.g. 1U or 1/2U) and
> proprietary motherboards and cases than actual performance in mind. But you
> might be lucky and these are indeed 2U / 4 GPU capable nodes and they were
> just not maxed out on the GPUs so there is space to move one of the GPUs to
> the other bus. - although these are HP nodes so I wouldn't bet on it. :-(
>
> With regards to compilation - try compiling on the login node - this
> should have $CUDA_HOME defined which is how the makefile knows where to
> find the nvcc compiler. Alternatively you may need to load the correct
> module for the cuda compilers. Or point CUDA_HOME to /usr/local/cuda - it
> depends how your system was setup. Either way your sysadmin should know how
> to compile check_p2p and can try it on your nodes.
>
> All the best
> Ross
>
> > On Jun 3, 2015, at 11:58 AM, Victor Ma <victordsmagift.gmail.com> wrote:
> >
> > Hello Ross,
> >
> > Thank you so much for the detailed explanation. I think I know what the
> > problem is. My command to run 2 gpus on a single node is right:
> > export CUDA_VISIBLE_DEVICES=0,1
> > mpirun -np 2 pmemd.cuda.MPI -O ...
> >
> > When I run make for check_p2p, the error message is:
> > /bin/nvcc -ccbin g++
> > -I/home/rcf-proj2/zz1/zhen009/membrane/amber/prep/openmpi-1/check_p2p
> > -m64 -o gpuP2PCheck.o -c gpuP2PCheck.cu
> > make: /bin/nvcc: Command not found
> > make: *** [gpuP2PCheck.o] Error 127
> >
> > I suppose nvcc is indeed not installed on the cluster or at least not
> under
> > /bin/nvcc
> >
> > And your guess is right: the two gpus are on two different buses:
> > lspci -t -v
> > -+-[0000:20]-+-00.0-[31]--
> > | +-01.0-[21]--
> > | +-01.1-[2a]--
> > | +-02.0-[24]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
> > | +-02.1-[2b]--
> > | +-02.2-[2c]--
> > | +-02.3-[2d]--
> > | +-03.0-[27]--
> > | +-03.1-[2e]--
> > | +-03.2-[2f]--
> > | +-03.3-[30]--
> > | +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
> > | +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
> > | +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
> > | +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
> > | +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
> > | +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
> > | +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
> > | +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
> > | +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
> > VTd_Misc, System Management
> > | +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
> > Global Errors
> > | \-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
> > \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
> > +-01.0-[05]----00.0 LSI Logic / Symbios Logic SAS2308
> > PCI-Express Fusion-MPT SAS-2
> > +-01.1-[06]--
> > +-02.0-[08]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
> > +-02.1-[0c]--
> > +-02.2-[0b]--
> > +-02.3-[0d]--
> > +-03.0-[07]----00.0 Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > +-03.1-[0e]--
> > +-03.2-[0f]--
> > +-03.3-[10]--
> > +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
> > +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
> > +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
> > +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
> > +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
> > +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
> > +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
> > +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
> > +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
> > VTd_Misc, System Management
> > +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
> > Global Errors
> > +-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
> > +-11.0-[04]--
> > +-1a.0 Intel Corporation C600/X79 series chipset USB2
> > Enhanced Host Controller #2
> > +-1c.0-[02]--+-00.0 Intel Corporation I350 Gigabit Network
> > Connection
> > | \-00.1 Intel Corporation I350 Gigabit Network
> > Connection
> > +-1c.7-[01]--+-00.0 Hewlett-Packard Company Integrated
> > Lights-Out Standard Slave Instrumentation & System Support
> > | +-00.1 Matrox Electronics Systems Ltd. MGA
> G200EH
> > | +-00.2 Hewlett-Packard Company Integrated
> > Lights-Out Standard Management Processor Support and Messaging
> > | \-00.4 Hewlett-Packard Company Integrated
> > Lights-Out Standard Virtual USB Controller
> > +-1d.0 Intel Corporation C600/X79 series chipset USB2
> > Enhanced Host Controller #1
> > +-1e.0-[03]--
> > +-1f.0 Intel Corporation C600/X79 series chipset LPC
> > Controller
> > \-1f.2 Intel Corporation C600/X79 series chipset 6-Port SATA
> > AHCI Controller
> >
> > I will let the system admin know and hope they might do something. :(
> >
> > Thanks again and really appreciate it.
> >
> > Victor
> >
> >
> > On Wed, Jun 3, 2015 at 11:31 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Victor,
> >>
> >> Do not attempt to run regular GPU MD runs across multiple nodes.
> >> Infiniband is way too slow these days to keep up with the computation
> speed
> >> of the GPUs. The only types of simulation that you can run over multiple
> >> nodes with GPUs are loosely coupled runs such as those based on Replica
> >> exchange approaches.
> >>
> >> In terms of using more than one GPU within a node for a single MD run it
> >> is crucial that they can communicate via peer to peer over the PCI-E
> bus.
> >> Having to go through the CPU chipset (which is what happens when they
> can't
> >> talk via Peer to Peer is also too slow these days). In terms of CPU
> counts
> >> for multi-GPU runs - the CPU is used purely to control the GPU - as such
> >> running with -np 16 does not help - it actually runs 16 GPU 'instances'
> >> which end up as 8 on each of your GPUs which really slows things down.
> We
> >> could have taken the NAMD / Gromacs approach of only offloading part of
> the
> >> calculation to the GPU and using the CPUs for the remainder but the net
> >> result is you actually end up slower overall than just taking the
> >> 'everything on the GPU approach' and leaving the excess CPUs idle. That
> >> said you can use those CPUs for other jobs. E.g.
> >>
> >> export CUDA_VISIBLE_DEVICES=0
> >> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.0 -o mdout.0 ... &
> >> export CUDA_VISIBLE_DEVICES=1
> >> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.1 -o mdout.1 ... &
> >> nohup mpirun -np 14 $AMBERHOME/bin/pmemd.MPI -O -i mdin.2 -o mdout.2
> ... &
> >>
> >> So the CPUs are not entirely wasted - although this takes a carefully
> >> crafted scheduler on a cluster.
> >>
> >> On terms of using the 2 GPUs at the same time the correct command line
> is,
> >> for your 2 GPU case:
> >>
> >> export CUDA_VISIBLE_DEVICES=0,1
> >> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
> >>
> >> The issue is that without P2P it is impossible to get speedup (for
> non-GB
> >> calculations) over multiple GPUs. In this case the best you can do is
> run
> >> two GPU runs, one on each GPU as above.
> >>
> >> Is there a reason you cannot build the check_p2p code? It's real simple
> -
> >> I'd be shocked if the cluster did not have make and nvcc installed. How
> >> would anyone compile their code for it? - how did they compile AMBER 14?
> >>
> >> One thing you can quickly try is running lspci | grep nvidia on one of
> the
> >> nodes. E.g.
> >>
> >> [root.GTX_TD ~]# lspci | grep NVIDIA
> >> 02:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
> >> 02:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
> >> 03:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
> >> 03:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
> >> 82:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
> >> 82:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
> >> 83:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
> >> 83:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
> >>
> >> Here you get the bus numbers that the GPUs are connected to. In this
> case
> >> there are 4 GPUs. one on bus 02, one on bus 03, one on bus 82 and one on
> >> bus 83. You can then run 'lspci -t -v' to get a full bus connectivity
> >> listing. In this case (pulling out the bits relevant to the GPUs) we
> have:
> >>
> >> +-[0000:80]-+-00.0-[81]--+-00.0 Intel Corporation I350 Gigabit Network
> >> Connection
> >> | | \-00.1 Intel Corporation I350 Gigabit Network
> >> Connection
> >> | +-02.0-[82]--+-00.0 NVIDIA Corporation GM204
> >> | | \-00.1 NVIDIA Corporation Device 0fbb
> >> | +-03.0-[83]--+-00.0 NVIDIA Corporation GM204
> >> | | \-00.1 NVIDIA Corporation Device 0fbb
> >> | +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
> >>
> >> and
> >>
> >> \-[0000:00]-+-00.0 Intel Corporation Xeon E5 v3/Core i7 DMI2
> >> +-01.0-[01]--
> >> +-02.0-[02]--+-00.0 NVIDIA Corporation GM204
> >> | \-00.1 NVIDIA Corporation Device 0fbb
> >> +-03.0-[03]--+-00.0 NVIDIA Corporation GM204
> >> | \-00.1 NVIDIA Corporation Device 0fbb
> >> +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
> >>
> >>
> >> So you see here that the 4 GPUs are in two groups, one set of two on one
> >> bus (connected to one of the CPU sockets) and the other set of two on
> the
> >> other bus connected to the other CPU socket. GPUs here can only
> communicate
> >> via P2P if they are on the same PCI bus. So here GPUs 0 and 1 can do P2P
> >> and 2 and 3 can do P2P but the combinations 0-2,0-3,1-2,1-3 are not
> >> supported.
> >>
> >> In the case of your system I suspect that they placed one GPU on one bus
> >> and one GPU on the other bus - this is about the worst combination you
> can
> >> make for having two GPUs in the same node. If this is the case then you
> >> need to ask the administrators to please physically move one of the
> GPUs to
> >> a different PCI-E slot such that they are both connected to the same
> >> physical CPU socket.
> >>
> >> Confusing and annoying but unfortunately a complexity that most people
> >> building clusters these days don't consider.
> >>
> >> Hope that helps.
> >>
> >> All the best
> >> Ross
> >>
> >>> On Jun 3, 2015, at 10:54 AM, Victor Ma <victordsmagift.gmail.com>
> wrote:
> >>>
> >>> Hello Amber community,
> >>>
> >>> I am testing my amber14 on a gpu cluster with IB. I noticed that when I
> >>> turn on openmpi with pmemd.cuda.MPI, it actually slows things down.
> >>> On single node, I have two gpus and 16 cpus. If I submit a job using
> >>> "pmemd.cuda.MPI -O -i .....", one gpu is 99% used and P2P support is
> on.
> >>> For my big system, I am getting ~27ns/day. If I turn on openmpi and use
> >>> this instead "export CUDA_VISIBLE_DEVICES=0,1 then mpirun -np 2
> >>> pmemd.cuda.MPI -O -i ....", two gpus are 77% used each but P2P is OFF.
> In
> >>> this case, I am getting 33 ns/day. It is faster but I suspect that it
> >> could
> >>> be even faster if the P2P is on. The other thing I tried is to run
> >> "mpirun
> >>> -np 16 pmemd.cuda.MPI -O -i ....". Here the run is slowed down to
> >> 14ns/day.
> >>> One GPU is used and all 16 cpus are used. Again p2p is off.
> >>>
> >>> I downloaded the check_p2p scripts. But as I am working on a cluster
> and
> >>> therefore do could not run "make".
> >>>
> >>> I am pretty happy with the speed I am getting but also wondering if the
> >>> configuration can be further optimized to improve performance, eg
> running
> >>> on 2gpus 100% with P2P on.
> >>>
> >>>
> >>> Thank you!
> >>>
> >>>
> >>> Victor
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 03 2015 - 13:30:02 PDT
Custom Search