Re: [AMBER] Amber16 on K80 GPUs --poor performance on multiple GPUs from Susan Chacko on 2017-01-05 (Amber Archive Jan 2017)

From: Susan Chacko <susanc.helix.nih.gov>
Date: Thu, 5 Jan 2017 11:41:00 -0500

I had read that and tried adding the --bind-to-none flag, but it didn't
make a difference. I've also tried explicitly allocating CPUs with
'taskset', selecting CPUs that are on the same socket as the GPUs.

There are no other jobs running on this K80 node: just my one benchmark
run.

I have a little additional information to add now. We have both K20xs
and K80s in our cluster. For the Factor IX NPT benchmark:

K20x, 1 GPU: 31.94 ns/day
K20x, 2 GPU: 30.97 ns/day

K80, 1 GPU: 30.26 ns/day
K80, 2 GPU: 1.13 ns/day

Thus, the K20x's behave similar to my benchmarks with Amber14. I only
see the big drop in performance with the K80s.

The instruction sets for the K20xs are different from the K80s.
K20x: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb
xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep
erms

K80: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc

I had originally built on the K20x, so that the executable would run on
either type of GPU. I tried rebuilding on the K80, but even with this
K80-built executable, I'm still getting ~30 ns/day on 1 K80, and ~1
ns/day on 2 K80s.

One thing I noticed is that I see 4 processes running on 2 K80 GPUs,
while only 2 processes run on 2 K20 GPUs. i.e.

K20x nvidia-smi:
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 Tesla K20Xm On | 0000:08:00.0 Off
| Off |
| N/A 33C P0 97W / 235W | 275MiB / 6143MiB | 62% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20Xm On | 0000:27:00.0 Off
| Off |
| N/A 35C P0 103W / 235W | 333MiB / 6143MiB | 71% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 55048 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
259MiB |
| 1 55049 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
317MiB |
+-----------------------------------------------------------------------------+

K80x nvidia-smi:
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:83:00.0 Off
| 0 |
| N/A 34C P8 27W / 149W | 22MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:84:00.0 Off
| 0 |
| N/A 26C P8 31W / 149W | 22MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:8A:00.0 Off
| 0 |
| N/A 75C P0 73W / 149W | 349MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:8B:00.0 Off
| 0 |
| N/A 49C P0 80W / 149W | 407MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 29646 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
258MiB |
| 2 29647 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
64MiB |
| 3 29646 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
64MiB |
| 3 29647 C ...cal/apps/amber/amber16/bin/pmemd.cuda.MPI
316MiB |
+-----------------------------------------------------------------------------+

Is this significant? I'm running the exact same command and executable
on both types of GPUs, i.e.

mpirun --bind-to none -np 2 `which pmemd.cuda.MPI` -O -i mdin.GPU -o
mdout -p prmtop -c inpcrd

The CPU processes look fine, i.e. both the K20x and K80s show 2 MPI CPU
processes.

Susan

On 1/4/17 8:05 AM, Ross Walker wrote:
> Hi Susan,
>
> Please see the following:
>
> http://ambermd.org/gpus/#Max_Perf <http://ambermd.org/gpus/#Max_Perf>
>
> particularly item 10 with regards to OpenMPI. I reproduced it here for convenience but I'd recommend reading the whole page.
>
>
> 10. If you see that performance when running multiple - multi-GPU runs is bad. That is that say you run 2 x 2GPU jobs and they don't both run at full speed as if the other job was never running then make sure you turn off thread affinity within your MPI implementation or at least set each MPI thread to use a difference core. In my experience MPICH does not have this on by default and so no special settings are needed however both MVAPICH and OpenMPI set thread affinity by default. This would actually be useful if they did it in an intelligent way. However, it seems they pay no attention to load or even other MVAPICH or OpenMPI runs and always just assign from core 0. So 2 x 2 GPU jobs are, rather foolishly, assigned to cores 0 and 1 in both cases. The simplest solution here is to just disable thread affinity as follows:
>
> MVAPICH: export MV2_ENABLE_AFFINITY=0; mpirun -np 2 ...
> OpenMPI: mpirun --bind-to none -np 2 ...
>
> All the best
> Ross
>
>
>> On Jan 3, 2017, at 12:36, Susan Chacko <susanc.helix.nih.gov> wrote:
>>
>> According to mdout, peer-to-peer support is enabled.
>>
>> |------------------- GPU DEVICE INFO --------------------
>> |
>> | Task ID: 0
>> | CUDA_VISIBLE_DEVICES: not set
>> | CUDA Capable Devices Detected: 4
>> | CUDA Device ID in use: 0
>> | CUDA Device Name: Tesla K80
>> | CUDA Device Global Mem Size: 11519 MB
>> | CUDA Device Num Multiprocessors: 13
>> | CUDA Device Core Freq: 0.82 GHz
>> |
>> |
>> | Task ID: 1
>> | CUDA_VISIBLE_DEVICES: not set
>> | CUDA Capable Devices Detected: 4
>> | CUDA Device ID in use: 1
>> | CUDA Device Name: Tesla K80
>> | CUDA Device Global Mem Size: 11519 MB
>> | CUDA Device Num Multiprocessors: 13
>> | CUDA Device Core Freq: 0.82 GHz
>> |
>> |--------------------------------------------------------
>>
>> |---------------- GPU PEER TO PEER INFO -----------------
>> |
>> | Peer to Peer support: ENABLED
>>
>>
>> I also downloaded and ran the check_p2p program from the Amber site, and
>> got:
>>
>> -----------
>>
>> % ./gpuP2PCheck
>> CUDA_VISIBLE_DEVICES is unset.
>> CUDA-capable device count: 4
>> GPU0 " Tesla K80"
>> GPU1 " Tesla K80"
>> GPU2 " Tesla K80"
>> GPU3 " Tesla K80"
>>
>> Two way peer access between:
>> GPU0 and GPU1: YES
>> GPU0 and GPU2: YES
>> GPU0 and GPU3: YES
>> GPU1 and GPU2: YES
>> GPU1 and GPU3: YES
>> GPU2 and GPU3: YES
>>
>> -----------
>>
>> So in theory I should be able to run on up to 4 GPUs.
>> I'll try rebuilding with CUDA 8.0 next, as Huang Jing suggested, unless
>> anyone else has other ideas.
>>
>> Susan.
>>
>>
>> On 1/3/17 11:25 AM, Daniel Roe wrote:
>>> Hi,
>>>
>>> See the 'Multi GPU' section in http://ambermd.org/gpus/#Running for
>>> some tips. In particular you need to make sure that the GPUs can run
>>> with direct peer-to-peer communication to get any kind of speedup for
>>> multi GPUs (this is printed somewhere near the top of mdout output).
>>>
>>> -Dan
>>>
>>> On Tue, Jan 3, 2017 at 11:00 AM, Susan Chacko <susanc.helix.nih.gov> wrote:
>>>> Hi all,
>>>>
>>>> I successfully built Amber 16 with Intel 2015.1.133, CUDA 7.5, and
>>>> OpenMPI 2.0.1. We're running Centos 6.8 and Nvidia drivers 352.39 on
>>>> K80x GPUs.
>>>>
>>>> I ran the benchmark suite. I'm getting approx the same results as shown
>>>> on the Amber16 benchmark page for CPUs and 1 GPU
>>>> (http://ambermd.org/gpus/benchmarks.htm)
>>>>
>>>> e.g.
>>>>
>>>> Factor IX NPT
>>>>
>>>> Intel E5-2695 v3 . 2.30GHz, 28 cores: 9.58 ns/day
>>>>
>>>> 1 K80 GPU: 31.2 ns/day
>>>>
>>>> However, when I attempt to run on 2 K80 GPUs, performance drops
>>>> dramatically.
>>>> 2 K80 GPUs: 1.19 ns/day
>>>>
>>>> I'm running the pmemd.cuda_SPFP.MPI executable like this:
>>>> cd Amber16_Benchmark_Suite/PME/FactorIX_production_NPT
>>>> mpirun -np # /usr/local/apps/amber/amber16/bin/pmemd.cuda_SPFP.MPI -O -i
>>>> mdin.GPU -o mdout -p prmtop -c inpcrd
>>>> where # is 1 or 2.
>>>> Each of the individual GPUs ran this benchmark at ~31.2 ns/day, so I
>>>> don't think there is any intrinsic problem with any of GPU hardware.
>>>> I get the same drop in performance with pmemd.cuda_DPFP.MPI and
>>>> pmemd.cuda_SPXP.MPI
>>>>
>>>> Is this expected behaviour? I don't see a benchmark for 2 or more K80s
>>>> on the Amber16 GPUs benchmark page, so am not sure what to expect. I
>>>> also see that the benchmarks on that page were run with Amber16/ Centos
>>>> 7 + CUDA 8.0 + MPICH 3.1.4 and are running on later versions of the
>>>> Nvidia drivers than we have, but I would not expect those differences to
>>>> account for what I'm seeing.
>>>>
>>>> Any ideas? Is it worth rebuilding with CUDA 8.0, or MPICH instead of
>>>> OpenMPI?
>>>>
>>>> All thoughts and suggestions much appreciated,
>>>> Susan.
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> --
>> Susan Chacko, Ph.D.
>> HPC . NIH Staff
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

-- 
Susan Chacko, Ph.D.
HPC . NIH Staff
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Jan 05 2017 - 09:00:03 PST