Yeah - that's a sucky way of doing it. :-(
Going through host memory is an order of magnitude too slow. Intel have done their best to prevent efficient communication between GPUs on different PCI-E buses. E.g. it should be possible to do P2P efficiently over the QPI link but it simply doesn't work. (I suspect by design).
OpenMP won't help you - it doesn't magically change the laws of physics.
Your best way to use nodes designed in this way is to run 2 separate jobs per node. If the queuing system forces you to have a single node to yourself then you would do:
cd run1
export CUDA_VISIBLE_DEVICES=0
$AMBERHOME/bin/pmemd.cuda -O -i ... &
cd ../run2
export CUDA_VISIBLE_DEVICES=1
$AMBERHOME/bin/pmemd.cuda -O -i ... &
wait
The wait causes your run script to not terminate until all background jobs have completed. If your two runs are roughly similar lengths (in terms of wallclock time) this will load balance fairly well.
If the GPUs are set to process_exclusive mode (nvidia-smi -c 3) then you don't need the explicit CUDA_VISIBLE_DEVICES argument here. Some queueing systems (e.g. properly configured SGE) let you set the compute mode at job submission time. Otherwise you are stuck with whatever the system policy is.
The alternative (and better option) is a GPU aware queuing system that treats GPUs as individual resources. In this case you can request 1 GPU per node, the queuing system will allocate you a free GPU (on a node shared with someone else) and set CUDA_VISIBLE_DEVICES for you. 2 jobs each requesting 1 GPU would then run on each node with CUDA_VISIBLE_DEVICES set appropriately. Again though, this is a local policy decision.
Hope that helps. Looks like you are stuck with single GPU runs (2 jobs per node) I'm afraid.
All the best
Ross
> On Jun 3, 2015, at 1:29 PM, Victor Ma <victordsmagift.gmail.com> wrote:
>
> Hello Ross,
>
> I just heard back from the sysadmin:
>
> "You are correct in that the 2 GPUs are connected to different PCI-E
> nodes. This is a deliberate decision in that it balances aggregate I/O
> capability between the GPUs and the x86 processors, as there are 2 CPUs.
> As you have observed, this has the side-effect of disallowing GPUDirect
> communication.
>
> It is still possible to use 2 GPUs but your program must manage memory more
> explicitly. Transfers would then come down into host memory, across the
> inter-processor QPI channels, and into the second GPU. One possibility is
> to use multiple OpenMP threads or MPI ranks, one for each GPU; designing
> for the latter would allow your program to scale up to many nodes and GPUs."
>
> I thought we are using OpenMP as I am running:
> mpirun -np 2 pmemd.cuda.MPI -O ...
>
> Anyway thank you so much!
>
> Victor
>
>
>
> On Wed, Jun 3, 2015 at 12:29 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Victor,
>>
>> Indeed the two different buses is the problem. Hopefully you can convince
>> the sysadmin to physcially move one of the K20's to a different slot. This
>> may or may not be possible depending on the design of the underlying nodes.
>> I note they are K20m which is passively cooled which likely means some
>> ducted node which likely was designed more with space (e.g. 1U or 1/2U) and
>> proprietary motherboards and cases than actual performance in mind. But you
>> might be lucky and these are indeed 2U / 4 GPU capable nodes and they were
>> just not maxed out on the GPUs so there is space to move one of the GPUs to
>> the other bus. - although these are HP nodes so I wouldn't bet on it. :-(
>>
>> With regards to compilation - try compiling on the login node - this
>> should have $CUDA_HOME defined which is how the makefile knows where to
>> find the nvcc compiler. Alternatively you may need to load the correct
>> module for the cuda compilers. Or point CUDA_HOME to /usr/local/cuda - it
>> depends how your system was setup. Either way your sysadmin should know how
>> to compile check_p2p and can try it on your nodes.
>>
>> All the best
>> Ross
>>
>>> On Jun 3, 2015, at 11:58 AM, Victor Ma <victordsmagift.gmail.com> wrote:
>>>
>>> Hello Ross,
>>>
>>> Thank you so much for the detailed explanation. I think I know what the
>>> problem is. My command to run 2 gpus on a single node is right:
>>> export CUDA_VISIBLE_DEVICES=0,1
>>> mpirun -np 2 pmemd.cuda.MPI -O ...
>>>
>>> When I run make for check_p2p, the error message is:
>>> /bin/nvcc -ccbin g++
>>> -I/home/rcf-proj2/zz1/zhen009/membrane/amber/prep/openmpi-1/check_p2p
>>> -m64 -o gpuP2PCheck.o -c gpuP2PCheck.cu
>>> make: /bin/nvcc: Command not found
>>> make: *** [gpuP2PCheck.o] Error 127
>>>
>>> I suppose nvcc is indeed not installed on the cluster or at least not
>> under
>>> /bin/nvcc
>>>
>>> And your guess is right: the two gpus are on two different buses:
>>> lspci -t -v
>>> -+-[0000:20]-+-00.0-[31]--
>>> | +-01.0-[21]--
>>> | +-01.1-[2a]--
>>> | +-02.0-[24]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
>>> | +-02.1-[2b]--
>>> | +-02.2-[2c]--
>>> | +-02.3-[2d]--
>>> | +-03.0-[27]--
>>> | +-03.1-[2e]--
>>> | +-03.2-[2f]--
>>> | +-03.3-[30]--
>>> | +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
>>> | +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
>>> | +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
>>> | +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
>>> | +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
>>> | +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
>>> | +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
>>> | +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
>>> | +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
>>> VTd_Misc, System Management
>>> | +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
>>> Global Errors
>>> | \-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
>>> \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
>>> +-01.0-[05]----00.0 LSI Logic / Symbios Logic SAS2308
>>> PCI-Express Fusion-MPT SAS-2
>>> +-01.1-[06]--
>>> +-02.0-[08]----00.0 NVIDIA Corporation GK110GL [Tesla K20m]
>>> +-02.1-[0c]--
>>> +-02.2-[0b]--
>>> +-02.3-[0d]--
>>> +-03.0-[07]----00.0 Mellanox Technologies MT27500 Family
>>> [ConnectX-3]
>>> +-03.1-[0e]--
>>> +-03.2-[0f]--
>>> +-03.3-[10]--
>>> +-04.0 Intel Corporation Xeon E5/Core i7 DMA Channel 0
>>> +-04.1 Intel Corporation Xeon E5/Core i7 DMA Channel 1
>>> +-04.2 Intel Corporation Xeon E5/Core i7 DMA Channel 2
>>> +-04.3 Intel Corporation Xeon E5/Core i7 DMA Channel 3
>>> +-04.4 Intel Corporation Xeon E5/Core i7 DMA Channel 4
>>> +-04.5 Intel Corporation Xeon E5/Core i7 DMA Channel 5
>>> +-04.6 Intel Corporation Xeon E5/Core i7 DMA Channel 6
>>> +-04.7 Intel Corporation Xeon E5/Core i7 DMA Channel 7
>>> +-05.0 Intel Corporation Xeon E5/Core i7 Address Map,
>>> VTd_Misc, System Management
>>> +-05.2 Intel Corporation Xeon E5/Core i7 Control Status and
>>> Global Errors
>>> +-05.4 Intel Corporation Xeon E5/Core i7 I/O APIC
>>> +-11.0-[04]--
>>> +-1a.0 Intel Corporation C600/X79 series chipset USB2
>>> Enhanced Host Controller #2
>>> +-1c.0-[02]--+-00.0 Intel Corporation I350 Gigabit Network
>>> Connection
>>> | \-00.1 Intel Corporation I350 Gigabit Network
>>> Connection
>>> +-1c.7-[01]--+-00.0 Hewlett-Packard Company Integrated
>>> Lights-Out Standard Slave Instrumentation & System Support
>>> | +-00.1 Matrox Electronics Systems Ltd. MGA
>> G200EH
>>> | +-00.2 Hewlett-Packard Company Integrated
>>> Lights-Out Standard Management Processor Support and Messaging
>>> | \-00.4 Hewlett-Packard Company Integrated
>>> Lights-Out Standard Virtual USB Controller
>>> +-1d.0 Intel Corporation C600/X79 series chipset USB2
>>> Enhanced Host Controller #1
>>> +-1e.0-[03]--
>>> +-1f.0 Intel Corporation C600/X79 series chipset LPC
>>> Controller
>>> \-1f.2 Intel Corporation C600/X79 series chipset 6-Port SATA
>>> AHCI Controller
>>>
>>> I will let the system admin know and hope they might do something. :(
>>>
>>> Thanks again and really appreciate it.
>>>
>>> Victor
>>>
>>>
>>> On Wed, Jun 3, 2015 at 11:31 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>>
>>>> Hi Victor,
>>>>
>>>> Do not attempt to run regular GPU MD runs across multiple nodes.
>>>> Infiniband is way too slow these days to keep up with the computation
>> speed
>>>> of the GPUs. The only types of simulation that you can run over multiple
>>>> nodes with GPUs are loosely coupled runs such as those based on Replica
>>>> exchange approaches.
>>>>
>>>> In terms of using more than one GPU within a node for a single MD run it
>>>> is crucial that they can communicate via peer to peer over the PCI-E
>> bus.
>>>> Having to go through the CPU chipset (which is what happens when they
>> can't
>>>> talk via Peer to Peer is also too slow these days). In terms of CPU
>> counts
>>>> for multi-GPU runs - the CPU is used purely to control the GPU - as such
>>>> running with -np 16 does not help - it actually runs 16 GPU 'instances'
>>>> which end up as 8 on each of your GPUs which really slows things down.
>> We
>>>> could have taken the NAMD / Gromacs approach of only offloading part of
>> the
>>>> calculation to the GPU and using the CPUs for the remainder but the net
>>>> result is you actually end up slower overall than just taking the
>>>> 'everything on the GPU approach' and leaving the excess CPUs idle. That
>>>> said you can use those CPUs for other jobs. E.g.
>>>>
>>>> export CUDA_VISIBLE_DEVICES=0
>>>> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.0 -o mdout.0 ... &
>>>> export CUDA_VISIBLE_DEVICES=1
>>>> nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin.1 -o mdout.1 ... &
>>>> nohup mpirun -np 14 $AMBERHOME/bin/pmemd.MPI -O -i mdin.2 -o mdout.2
>> ... &
>>>>
>>>> So the CPUs are not entirely wasted - although this takes a carefully
>>>> crafted scheduler on a cluster.
>>>>
>>>> On terms of using the 2 GPUs at the same time the correct command line
>> is,
>>>> for your 2 GPU case:
>>>>
>>>> export CUDA_VISIBLE_DEVICES=0,1
>>>> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
>>>>
>>>> The issue is that without P2P it is impossible to get speedup (for
>> non-GB
>>>> calculations) over multiple GPUs. In this case the best you can do is
>> run
>>>> two GPU runs, one on each GPU as above.
>>>>
>>>> Is there a reason you cannot build the check_p2p code? It's real simple
>> -
>>>> I'd be shocked if the cluster did not have make and nvcc installed. How
>>>> would anyone compile their code for it? - how did they compile AMBER 14?
>>>>
>>>> One thing you can quickly try is running lspci | grep nvidia on one of
>> the
>>>> nodes. E.g.
>>>>
>>>> [root.GTX_TD ~]# lspci | grep NVIDIA
>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>>>> 02:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>>>> 03:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>>>> 82:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation GM204 (rev a1)
>>>> 83:00.1 Audio device: NVIDIA Corporation Device 0fbb (rev a1)
>>>>
>>>> Here you get the bus numbers that the GPUs are connected to. In this
>> case
>>>> there are 4 GPUs. one on bus 02, one on bus 03, one on bus 82 and one on
>>>> bus 83. You can then run 'lspci -t -v' to get a full bus connectivity
>>>> listing. In this case (pulling out the bits relevant to the GPUs) we
>> have:
>>>>
>>>> +-[0000:80]-+-00.0-[81]--+-00.0 Intel Corporation I350 Gigabit Network
>>>> Connection
>>>> | | \-00.1 Intel Corporation I350 Gigabit Network
>>>> Connection
>>>> | +-02.0-[82]--+-00.0 NVIDIA Corporation GM204
>>>> | | \-00.1 NVIDIA Corporation Device 0fbb
>>>> | +-03.0-[83]--+-00.0 NVIDIA Corporation GM204
>>>> | | \-00.1 NVIDIA Corporation Device 0fbb
>>>> | +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
>>>>
>>>> and
>>>>
>>>> \-[0000:00]-+-00.0 Intel Corporation Xeon E5 v3/Core i7 DMI2
>>>> +-01.0-[01]--
>>>> +-02.0-[02]--+-00.0 NVIDIA Corporation GM204
>>>> | \-00.1 NVIDIA Corporation Device 0fbb
>>>> +-03.0-[03]--+-00.0 NVIDIA Corporation GM204
>>>> | \-00.1 NVIDIA Corporation Device 0fbb
>>>> +-04.0 Intel Corporation Xeon E5 v3/Core i7 DMA Channel 0
>>>>
>>>>
>>>> So you see here that the 4 GPUs are in two groups, one set of two on one
>>>> bus (connected to one of the CPU sockets) and the other set of two on
>> the
>>>> other bus connected to the other CPU socket. GPUs here can only
>> communicate
>>>> via P2P if they are on the same PCI bus. So here GPUs 0 and 1 can do P2P
>>>> and 2 and 3 can do P2P but the combinations 0-2,0-3,1-2,1-3 are not
>>>> supported.
>>>>
>>>> In the case of your system I suspect that they placed one GPU on one bus
>>>> and one GPU on the other bus - this is about the worst combination you
>> can
>>>> make for having two GPUs in the same node. If this is the case then you
>>>> need to ask the administrators to please physically move one of the
>> GPUs to
>>>> a different PCI-E slot such that they are both connected to the same
>>>> physical CPU socket.
>>>>
>>>> Confusing and annoying but unfortunately a complexity that most people
>>>> building clusters these days don't consider.
>>>>
>>>> Hope that helps.
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>> On Jun 3, 2015, at 10:54 AM, Victor Ma <victordsmagift.gmail.com>
>> wrote:
>>>>>
>>>>> Hello Amber community,
>>>>>
>>>>> I am testing my amber14 on a gpu cluster with IB. I noticed that when I
>>>>> turn on openmpi with pmemd.cuda.MPI, it actually slows things down.
>>>>> On single node, I have two gpus and 16 cpus. If I submit a job using
>>>>> "pmemd.cuda.MPI -O -i .....", one gpu is 99% used and P2P support is
>> on.
>>>>> For my big system, I am getting ~27ns/day. If I turn on openmpi and use
>>>>> this instead "export CUDA_VISIBLE_DEVICES=0,1 then mpirun -np 2
>>>>> pmemd.cuda.MPI -O -i ....", two gpus are 77% used each but P2P is OFF.
>> In
>>>>> this case, I am getting 33 ns/day. It is faster but I suspect that it
>>>> could
>>>>> be even faster if the P2P is on. The other thing I tried is to run
>>>> "mpirun
>>>>> -np 16 pmemd.cuda.MPI -O -i ....". Here the run is slowed down to
>>>> 14ns/day.
>>>>> One GPU is used and all 16 cpus are used. Again p2p is off.
>>>>>
>>>>> I downloaded the check_p2p scripts. But as I am working on a cluster
>> and
>>>>> therefore do could not run "make".
>>>>>
>>>>> I am pretty happy with the speed I am getting but also wondering if the
>>>>> configuration can be further optimized to improve performance, eg
>> running
>>>>> on 2gpus 100% with P2P on.
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>> Victor
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 03 2015 - 14:00:03 PDT