Re: [AMBER] Amber16 on K80 GPUs --poor performance on multiple GPUs from Ross Walker on 2017-01-04 (Amber Archive Jan 2017)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 4 Jan 2017 08:05:08 -0500

Hi Susan,

Please see the following:

http://ambermd.org/gpus/#Max_Perf <http://ambermd.org/gpus/#Max_Perf>

particularly item 10 with regards to OpenMPI. I reproduced it here for convenience but I'd recommend reading the whole page.

10. If you see that performance when running multiple - multi-GPU runs is bad. That is that say you run 2 x 2GPU jobs and they don't both run at full speed as if the other job was never running then make sure you turn off thread affinity within your MPI implementation or at least set each MPI thread to use a difference core. In my experience MPICH does not have this on by default and so no special settings are needed however both MVAPICH and OpenMPI set thread affinity by default. This would actually be useful if they did it in an intelligent way. However, it seems they pay no attention to load or even other MVAPICH or OpenMPI runs and always just assign from core 0. So 2 x 2 GPU jobs are, rather foolishly, assigned to cores 0 and 1 in both cases. The simplest solution here is to just disable thread affinity as follows:

MVAPICH: export MV2_ENABLE_AFFINITY=0; mpirun -np 2 ...
OpenMPI: mpirun --bind-to none -np 2 ...

All the best
Ross

> On Jan 3, 2017, at 12:36, Susan Chacko <susanc.helix.nih.gov> wrote:
>
> According to mdout, peer-to-peer support is enabled.
>
> |------------------- GPU DEVICE INFO --------------------
> |
> | Task ID: 0
> | CUDA_VISIBLE_DEVICES: not set
> | CUDA Capable Devices Detected: 4
> | CUDA Device ID in use: 0
> | CUDA Device Name: Tesla K80
> | CUDA Device Global Mem Size: 11519 MB
> | CUDA Device Num Multiprocessors: 13
> | CUDA Device Core Freq: 0.82 GHz
> |
> |
> | Task ID: 1
> | CUDA_VISIBLE_DEVICES: not set
> | CUDA Capable Devices Detected: 4
> | CUDA Device ID in use: 1
> | CUDA Device Name: Tesla K80
> | CUDA Device Global Mem Size: 11519 MB
> | CUDA Device Num Multiprocessors: 13
> | CUDA Device Core Freq: 0.82 GHz
> |
> |--------------------------------------------------------
>
> |---------------- GPU PEER TO PEER INFO -----------------
> |
> | Peer to Peer support: ENABLED
>
>
> I also downloaded and ran the check_p2p program from the Amber site, and
> got:
>
> -----------
>
> % ./gpuP2PCheck
> CUDA_VISIBLE_DEVICES is unset.
> CUDA-capable device count: 4
> GPU0 " Tesla K80"
> GPU1 " Tesla K80"
> GPU2 " Tesla K80"
> GPU3 " Tesla K80"
>
> Two way peer access between:
> GPU0 and GPU1: YES
> GPU0 and GPU2: YES
> GPU0 and GPU3: YES
> GPU1 and GPU2: YES
> GPU1 and GPU3: YES
> GPU2 and GPU3: YES
>
> -----------
>
> So in theory I should be able to run on up to 4 GPUs.
> I'll try rebuilding with CUDA 8.0 next, as Huang Jing suggested, unless
> anyone else has other ideas.
>
> Susan.
>
>
> On 1/3/17 11:25 AM, Daniel Roe wrote:
>> Hi,
>>
>> See the 'Multi GPU' section in http://ambermd.org/gpus/#Running for
>> some tips. In particular you need to make sure that the GPUs can run
>> with direct peer-to-peer communication to get any kind of speedup for
>> multi GPUs (this is printed somewhere near the top of mdout output).
>>
>> -Dan
>>
>> On Tue, Jan 3, 2017 at 11:00 AM, Susan Chacko <susanc.helix.nih.gov> wrote:
>>> Hi all,
>>>
>>> I successfully built Amber 16 with Intel 2015.1.133, CUDA 7.5, and
>>> OpenMPI 2.0.1. We're running Centos 6.8 and Nvidia drivers 352.39 on
>>> K80x GPUs.
>>>
>>> I ran the benchmark suite. I'm getting approx the same results as shown
>>> on the Amber16 benchmark page for CPUs and 1 GPU
>>> (http://ambermd.org/gpus/benchmarks.htm)
>>>
>>> e.g.
>>>
>>> Factor IX NPT
>>>
>>> Intel E5-2695 v3 . 2.30GHz, 28 cores: 9.58 ns/day
>>>
>>> 1 K80 GPU: 31.2 ns/day
>>>
>>> However, when I attempt to run on 2 K80 GPUs, performance drops
>>> dramatically.
>>> 2 K80 GPUs: 1.19 ns/day
>>>
>>> I'm running the pmemd.cuda_SPFP.MPI executable like this:
>>> cd Amber16_Benchmark_Suite/PME/FactorIX_production_NPT
>>> mpirun -np # /usr/local/apps/amber/amber16/bin/pmemd.cuda_SPFP.MPI -O -i
>>> mdin.GPU -o mdout -p prmtop -c inpcrd
>>> where # is 1 or 2.
>>> Each of the individual GPUs ran this benchmark at ~31.2 ns/day, so I
>>> don't think there is any intrinsic problem with any of GPU hardware.
>>> I get the same drop in performance with pmemd.cuda_DPFP.MPI and
>>> pmemd.cuda_SPXP.MPI
>>>
>>> Is this expected behaviour? I don't see a benchmark for 2 or more K80s
>>> on the Amber16 GPUs benchmark page, so am not sure what to expect. I
>>> also see that the benchmarks on that page were run with Amber16/ Centos
>>> 7 + CUDA 8.0 + MPICH 3.1.4 and are running on later versions of the
>>> Nvidia drivers than we have, but I would not expect those differences to
>>> account for what I'm seeing.
>>>
>>> Any ideas? Is it worth rebuilding with CUDA 8.0, or MPICH instead of
>>> OpenMPI?
>>>
>>> All thoughts and suggestions much appreciated,
>>> Susan.
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>
> --
> Susan Chacko, Ph.D.
> HPC . NIH Staff
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jan 04 2017 - 05:30:03 PST