Re: [AMBER] pmemd.cuda.MPI vs openmpi from Ross Walker on 2015-06-03 (Amber Archive Jun 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 3 Jun 2015 12:29:00 -0700

Hi Victor,

Indeed the two different buses is the problem. Hopefully you can convince the sysadmin to physcially move one of the K20's to a different slot. This may or may not be possible depending on the design of the underlying nodes. I note they are K20m which is passively cooled which likely means some ducted node which likely was designed more with space (e.g. 1U or 1/2U) and proprietary motherboards and cases than actual performance in mind. But you might be lucky and these are indeed 2U / 4 GPU capable nodes and they were just not maxed out on the GPUs so there is space to move one of the GPUs to the other bus. - although these are HP nodes so I wouldn't bet on it. :-(

With regards to compilation - try compiling on the login node - this should have $CUDA_HOME defined which is how the makefile knows where to find the nvcc compiler. Alternatively you may need to load the correct module for the cuda compilers. Or point CUDA_HOME to /usr/local/cuda - it depends how your system was setup. Either way your sysadmin should know how to compile check_p2p and can try it on your nodes.

All the best
Ross

> On Jun 3, 2015, at 11:58 AM, Victor Ma <victordsmagift.gmail.com> wrote:
>
> Hello Ross,
>
> Thank you so much for the detailed explanation. I think I know what the
> problem is. My command to run 2 gpus on a single node is right:
> export CUDA_VISIBLE_DEVICES=0,1
> mpirun -np 2 pmemd.cuda.MPI -O ...
>
> When I run make for check_p2p, the error message is:
> /bin/nvcc -ccbin g++
> -I/home/rcf-proj2/zz1/zhen009/membrane/amber/prep/openmpi-1/check_p2p
> -m64 -o gpuP2PCheck.o -c gpuP2PCheck.cu
> make: /bin/nvcc: Command not found
> make: *** [gpuP2PCheck.o] Error 127
>
> I suppose nvcc is indeed not installed on the cluster or at least not under
> /bin/nvcc
>
> And your guess is right: the two gpus are on two different buses:
> lspci -t -v
> -+-[0000:20]-+-00.0-[31]--
> | +-01.0-[21]--
> | +-01.1-[2a]--
> | +-02.0-[24]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
> | +-02.1-[2b]--
> | +-02.2-[2c]--
> | +-02.3-[2d]--
> | +-03.0-[27]--
> | +-03.1-[2e]--
> | +-03.2-[2f]--
> | +-03.3-[30]--
> | +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
> | +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
> | +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
> | +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
> | +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
> | +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
> | +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
> | +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
> | +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
> VTd_Misc, System Management
> | +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
> Global Errors
> | \-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
> \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
> +-01.0-[05]----00.0 LSI Logic / Symbios Logic SAS2308
> PCI-Express Fusion-MPT SAS-2
> +-01.1-[06]--
> +-02.0-[08]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
> +-02.1-[0c]--
> +-02.2-[0b]--
> +-02.3-[0d]--
> +-03.0-[07]----00.0 Mellanox Technologies MT27500 Family
> [ConnectX-3]
> +-03.1-[0e]--
> +-03.2-[0f]--
> +-03.3-[10]--
> +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
> +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
> +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
> +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
> +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
> +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
> +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
> +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
> +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
> VTd_Misc, System Management
> +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
> Global Errors
> +-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
> +-11.0-[04]--
> +-1a.0 Intel Corporation C600/X79 series chipset USB2
> Enhanced Host Controller #2
> +-1c.0-[02]--+-00.0 Intel Corporation I350 Gigabit Network
> Connection
> | \-00.1 Intel Corporation I350 Gigabit Network
> Connection
> +-1c.7-[01]--+-00.0 Hewlett-Packard Company Integrated
> Lights-Out Standard Slave Instrumentation & System Support
> | +-00.1 Matrox Electronics Systems Ltd. MGA G200EH
> | +-00.2 Hewlett-Packard Company Integrated
> Lights-Out Standard Management Processor Support and Messaging
> | \-00.4 Hewlett-Packard Company Integrated
> Lights-Out Standard Virtual USB Controller
> +-1d.0 Intel Corporation C600/X79 series chipset USB2
> Enhanced Host Controller #1
> +-1e.0-[03]--
> +-1f.0 Intel Corporation C600/X79 series chipset LPC
> Controller
> \-1f.2 Intel Corporation C600/X79 series chipset 6-Port SATA
> AHCI Controller
>
> I will let the system admin know and hope they might do something. :(
>
> Thanks again and really appreciate it.
>
> Victor
>
>
> On Wed, Jun 3, 2015 at 11:31 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Victor,
>>
>> Do not attempt to run regular GPU MD runs across multiple nodes.
>> Infiniband is way too slow these days to keep up with the computation speed
>> of the GPUs. The only types of simulation that you can run over multiple
>> nodes with GPUs are loosely coupled runs such as those based on Replica
>> exchange approaches.
>>
>> In terms of using more than one GPU within a node for a single MD run it
>> is crucial that they can communicate via peer to peer over the PCI-E bus.
>> Having to go through the CPU chipset (which is what happens when they can't
>> talk via Peer to Peer is also too slow these days). In terms of CPU counts
>> for multi-GPU runs - the CPU is used purely to control the GPU - as such
>> running with -np 16 does not help - it actually runs 16 GPU 'instances'
>> which end up as 8 on each of your GPUs which really slows things down. We
>> could have taken the NAMD / Gromacs approach of only offloading part of the
>> calculation to the GPU and using the CPUs for the remainder but the net
>> result is you actually end up slower overall than just taking the
>> 'everything on the GPU approach' and leaving the excess CPUs idle. That
>> said you can use those CPUs for other jobs. E.g.
>>
>> export CUDA_VISIBLE_DEVICES=0
>> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.0 -o mdout.0 ... &
>> export CUDA_VISIBLE_DEVICES=1
>> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.1 -o mdout.1 ... &
>> nohup mpirun -np 14 $AMBERHOME/bin/pmemd.MPI -O -i mdin.2 -o mdout.2 ... &
>>
>> So the CPUs are not entirely wasted - although this takes a carefully
>> crafted scheduler on a cluster.
>>
>> On terms of using the 2 GPUs at the same time the correct command line is,
>> for your 2 GPU case:
>>
>> export CUDA_VISIBLE_DEVICES=0,1
>> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
>>
>> The issue is that without P2P it is impossible to get speedup (for non-GB
>> calculations) over multiple GPUs. In this case the best you can do is run
>> two GPU runs, one on each GPU as above.
>>
>> Is there a reason you cannot build the check_p2p code? It's real simple -
>> I'd be shocked if the cluster did not have make and nvcc installed. How
>> would anyone compile their code for it? - how did they compile AMBER 14?
>>
>> One thing you can quickly try is running lspci | grep nvidia on one of the
>> nodes. E.g.
>>
>> [root.GTX_TD ~]# lspci | grep NVIDIA
>> 02:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>> 02:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>> 03:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>> 03:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>> 82:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>> 82:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>> 83:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>> 83:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>>
>> Here you get the bus numbers that the GPUs are connected to. In this case
>> there are 4 GPUs. one on bus 02, one on bus 03, one on bus 82 and one on
>> bus 83. You can then run 'lspci -t -v' to get a full bus connectivity
>> listing. In this case (pulling out the bits relevant to the GPUs) we have:
>>
>> +-[0000:80]-+-00.0-[81]--+-00.0 Intel Corporation I350 Gigabit Network
>> Connection
>> | | \-00.1 Intel Corporation I350 Gigabit Network
>> Connection
>> | +-02.0-[82]--+-00.0 NVIDIA Corporation GM204
>> | | \-00.1 NVIDIA Corporation Device 0fbb
>> | +-03.0-[83]--+-00.0 NVIDIA Corporation GM204
>> | | \-00.1 NVIDIA Corporation Device 0fbb
>> | +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
>>
>> and
>>
>> \-[0000:00]-+-00.0 Intel Corporation Xeon E5 v3/Core i7 DMI2
>> +-01.0-[01]--
>> +-02.0-[02]--+-00.0 NVIDIA Corporation GM204
>> | \-00.1 NVIDIA Corporation Device 0fbb
>> +-03.0-[03]--+-00.0 NVIDIA Corporation GM204
>> | \-00.1 NVIDIA Corporation Device 0fbb
>> +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
>>
>>
>> So you see here that the 4 GPUs are in two groups, one set of two on one
>> bus (connected to one of the CPU sockets) and the other set of two on the
>> other bus connected to the other CPU socket. GPUs here can only communicate
>> via P2P if they are on the same PCI bus. So here GPUs 0 and 1 can do P2P
>> and 2 and 3 can do P2P but the combinations 0-2,0-3,1-2,1-3 are not
>> supported.
>>
>> In the case of your system I suspect that they placed one GPU on one bus
>> and one GPU on the other bus - this is about the worst combination you can
>> make for having two GPUs in the same node. If this is the case then you
>> need to ask the administrators to please physically move one of the GPUs to
>> a different PCI-E slot such that they are both connected to the same
>> physical CPU socket.
>>
>> Confusing and annoying but unfortunately a complexity that most people
>> building clusters these days don't consider.
>>
>> Hope that helps.
>>
>> All the best
>> Ross
>>
>>> On Jun 3, 2015, at 10:54 AM, Victor Ma <victordsmagift.gmail.com> wrote:
>>>
>>> Hello Amber community,
>>>
>>> I am testing my amber14 on a gpu cluster with IB. I noticed that when I
>>> turn on openmpi with pmemd.cuda.MPI, it actually slows things down.
>>> On single node, I have two gpus and 16 cpus. If I submit a job using
>>> "pmemd.cuda.MPI -O -i .....", one gpu is 99% used and P2P support is on.
>>> For my big system, I am getting ~27ns/day. If I turn on openmpi and use
>>> this instead "export CUDA_VISIBLE_DEVICES=0,1 then mpirun -np 2
>>> pmemd.cuda.MPI -O -i ....", two gpus are 77% used each but P2P is OFF. In
>>> this case, I am getting 33 ns/day. It is faster but I suspect that it
>> could
>>> be even faster if the P2P is on. The other thing I tried is to run
>> "mpirun
>>> -np 16 pmemd.cuda.MPI -O -i ....". Here the run is slowed down to
>> 14ns/day.
>>> One GPU is used and all 16 cpus are used. Again p2p is off.
>>>
>>> I downloaded the check_p2p scripts. But as I am working on a cluster and
>>> therefore do could not run "make".
>>>
>>> I am pretty happy with the speed I am getting but also wondering if the
>>> configuration can be further optimized to improve performance, eg running
>>> on 2gpus 100% with P2P on.
>>>
>>>
>>> Thank you!
>>>
>>>
>>> Victor
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 03 2015 - 12:30:02 PDT