Re: [AMBER] Amber16 Parallel CUDA Tests from Ross Walker on 2016-08-11 (Amber Archive Aug 2016)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 11 Aug 2016 20:17:16 -0700

Hi Steve,

I suspect your hardware is misconfigured. Can you run a couple of tests please.

With CUDA_VISIBLE_DEVICES unset

1) As root run: lspci -d "10b5:*" -vvv | grep ACSCtl

and post the output here.

2) Compile the CUDA samples installed as part of CUDA 7.5 and then run the following:

unset CUDA_VISIBLE_DEVICES
./simpleP2P

export CUDA_VISIBLE_DEVICES=0,1
./simpleP2P

export CUDA_VISIBLE_DEVICES=2,3
./simpleP2P

export CUDA_VISIBLE_DEVICES=0,2
./simpleP2P

And post the results here.

My suspicion is that your two K80s are on different PCI-E domains connected to different CPU sockets BUT your bios is misconfigured such that it is incorrectly reporting that the two K80s can talk to each other via P2P. Thus the first two simpleP2P runs above should pass. The last one will likely report that P2P is possible but then the bandwidth will be very low and it will ultimately fail the test because the array received by GPU 2 will be garbage.

If my suspicions are correct you would find the following behavior with AMBER

4 x 1 GPU runs, one on each GPU would be fine.
(1 or 2) x 2 GPU runs will be fine if you use GPUS 0,1 and 2,3 but will fail if you were to use 0,2 - 0,3 - 1,2 or 1,3
1 x 4 GPU runs will fail unless you restrict it to GPUs 0,1 or 2,3 and thus overload the GPUs.

Ps. nvidia-smi reporting 2 threads per mpi task is not an issue - it to be expected.

All the best
Ross

> On Aug 11, 2016, at 7:54 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:
>
> Hello,
>
> I'm still trying to figure out why the MPI CUDA tests are failing.
>
> If I run tests with DO_PARALLEL="mpirun -np 4" and limit CUDA_VISIBLE_DEVICES to only 0 or 1, all tests pass. I get the same behavior with OpenMPI 1.8, 1.10, 2.0 and mpich 3.1.
>
> I ran gpuP2PCheck just in case communication between the GPUs was the problem. It confirms that communication is working:
>
> CUDA-capable device count: 2
> GPU0 " Tesla K80"
> GPU1 " Tesla K80"
>
> Two way peer access between:
> GPU0 and GPU1: YES
>
> If it's of any use, here is the output of nvidia-smi -q:
>
> ==============NVSMI LOG==============
>
> Timestamp : Thu Aug 11 22:42:34 2016
> Driver Version : 352.93
>
> Attached GPUs : 2
> GPU 0000:05:00.0
> Product Name : Tesla K80
> Product Brand : Tesla
> Display Mode : Disabled
> Display Active : Disabled
> Persistence Mode : Disabled
> Accounting Mode : Disabled
> Accounting Mode Buffer Size : 1920
> Driver Model
> Current : N/A
> Pending : N/A
> Serial Number : 0325015055313
> GPU UUID : GPU-a65eaa77-8871-ded5-b6ee-5268404192f1
> Minor Number : 0
> VBIOS Version : 80.21.1B.00.01
> MultiGPU Board : Yes
> Board ID : 0x300
> Inforom Version
> Image Version : 2080.0200.00.04
> OEM Object : 1.1
> ECC Object : 3.0
> Power Management Object : N/A
> GPU Operation Mode
> Current : N/A
> Pending : N/A
> PCI
> Bus : 0x05
> Device : 0x00
> Domain : 0x0000
> Device Id : 0x102D10DE
> Bus Id : 0000:05:00.0
> Sub System Id : 0x106C10DE
> GPU Link Info
> PCIe Generation
> Max : 3
> Current : 3
> Link Width
> Max : 16x
> Current : 16x
> Bridge Chip
> Type : PLX
> Firmware : 0xF0472900
> Replays since reset : 0
> Tx Throughput : N/A
> Rx Throughput : N/A
> Fan Speed : N/A
> Performance State : P0
> Clocks Throttle Reasons
> Idle : Not Active
> Applications Clocks Setting : Active
> SW Power Cap : Not Active
> HW Slowdown : Not Active
> Unknown : Not Active
> FB Memory Usage
> Total : 12287 MiB
> Used : 56 MiB
> Free : 12231 MiB
> BAR1 Memory Usage
> Total : 16384 MiB
> Used : 2 MiB
> Free : 16382 MiB
> Compute Mode : Default
> Utilization
> Gpu : 0 %
> Memory : 0 %
> Encoder : 0 %
> Decoder : 0 %
> Ecc Mode
> Current : Disabled
> Pending : Disabled
> ECC Errors
> Volatile
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Aggregate
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Retired Pages
> Single Bit ECC : 0
> Double Bit ECC : 0
> Pending : No
> Temperature
> GPU Current Temp : 31 C
> GPU Shutdown Temp : 93 C
> GPU Slowdown Temp : 88 C
> Power Readings
> Power Management : Supported
> Power Draw : 59.20 W
> Power Limit : 149.00 W
> Default Power Limit : 149.00 W
> Enforced Power Limit : 149.00 W
> Min Power Limit : 100.00 W
> Max Power Limit : 175.00 W
> Clocks
> Graphics : 562 MHz
> SM : 562 MHz
> Memory : 2505 MHz
> Applications Clocks
> Graphics : 562 MHz
> Memory : 2505 MHz
> Default Applications Clocks
> Graphics : 562 MHz
> Memory : 2505 MHz
> Max Clocks
> Graphics : 875 MHz
> SM : 875 MHz
> Memory : 2505 MHz
> Clock Policy
> Auto Boost : On
> Auto Boost Default : On
> Processes : None
>
> GPU 0000:06:00.0
> Product Name : Tesla K80
> Product Brand : Tesla
> Display Mode : Disabled
> Display Active : Disabled
> Persistence Mode : Disabled
> Accounting Mode : Disabled
> Accounting Mode Buffer Size : 1920
> Driver Model
> Current : N/A
> Pending : N/A
> Serial Number : 0325015055313
> GPU UUID : GPU-21c2be1c-72a9-1b68-adab-459d05dd7adc
> Minor Number : 1
> VBIOS Version : 80.21.1B.00.02
> MultiGPU Board : Yes
> Board ID : 0x300
> Inforom Version
> Image Version : 2080.0200.00.04
> OEM Object : 1.1
> ECC Object : 3.0
> Power Management Object : N/A
> GPU Operation Mode
> Current : N/A
> Pending : N/A
> PCI
> Bus : 0x06
> Device : 0x00
> Domain : 0x0000
> Device Id : 0x102D10DE
> Bus Id : 0000:06:00.0
> Sub System Id : 0x106C10DE
> GPU Link Info
> PCIe Generation
> Max : 3
> Current : 3
> Link Width
> Max : 16x
> Current : 16x
> Bridge Chip
> Type : PLX
> Firmware : 0xF0472900
> Replays since reset : 0
> Tx Throughput : N/A
> Rx Throughput : N/A
> Fan Speed : N/A
> Performance State : P0
> Clocks Throttle Reasons
> Idle : Not Active
> Applications Clocks Setting : Active
> SW Power Cap : Not Active
> HW Slowdown : Not Active
> Unknown : Not Active
> FB Memory Usage
> Total : 12287 MiB
> Used : 56 MiB
> Free : 12231 MiB
> BAR1 Memory Usage
> Total : 16384 MiB
> Used : 2 MiB
> Free : 16382 MiB
> Compute Mode : Default
> Utilization
> Gpu : 0 %
> Memory : 0 %
> Encoder : 0 %
> Decoder : 0 %
> Ecc Mode
> Current : Disabled
> Pending : Disabled
> ECC Errors
> Volatile
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Aggregate
> Single Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Double Bit
> Device Memory : N/A
> Register File : N/A
> L1 Cache : N/A
> L2 Cache : N/A
> Texture Memory : N/A
> Total : N/A
> Retired Pages
> Single Bit ECC : 0
> Double Bit ECC : 0
> Pending : No
> Temperature
> GPU Current Temp : 24 C
> GPU Shutdown Temp : 93 C
> GPU Slowdown Temp : 88 C
> Power Readings
> Power Management : Supported
> Power Draw : 70.89 W
> Power Limit : 149.00 W
> Default Power Limit : 149.00 W
> Enforced Power Limit : 149.00 W
> Min Power Limit : 100.00 W
> Max Power Limit : 175.00 W
> Clocks
> Graphics : 562 MHz
> SM : 562 MHz
> Memory : 2505 MHz
> Applications Clocks
> Graphics : 562 MHz
> Memory : 2505 MHz
> Default Applications Clocks
> Graphics : 562 MHz
> Memory : 2505 MHz
> Max Clocks
> Graphics : 875 MHz
> SM : 875 MHz
> Memory : 2505 MHz
> Clock Policy
> Auto Boost : On
> Auto Boost Default : On
> Processes : None
>
>
> If it matters, when I do the tests with DO_PARALLEL="mpirun -np 4", I see that each process is running a thread on both GPUs. For example:
>
> # gpu pid type sm mem enc dec command
> # Idx # C/G % % % % name
> 0 30599 C 24 0 0 0 pmemd.cuda_DPFP
> 0 30600 C 0 0 0 0 pmemd.cuda_DPFP
> 0 30601 C 11 0 0 0 pmemd.cuda_DPFP
> 0 30602 C 0 0 0 0 pmemd.cuda_DPFP
> 1 30599 C 0 0 0 0 pmemd.cuda_DPFP
> 1 30600 C 36 0 0 0 pmemd.cuda_DPFP
> 1 30601 C 0 0 0 0 pmemd.cuda_DPFP
> 1 30602 C 6 0 0 0 pmemd.cuda_DPFP
>
> Is that expected behavior?
>
> Has anybody else had any problems using K80s with MPI and CUDA? Or using CentOS/RHEL 6?
>
> This machine does have dual CPUs, could that be a factor?
>
> I'm currently using AmberTools version 16.12 and Amber version 16.05.
>
> Any insight would be greatly appreciated.
>
> Thanks,
>
> Steve
>
>
>
> On Mon, Jul 25, 2016 at 3:06 PM, Steven Ford <sford123.ibbr.umd.edu <mailto:sford123.ibbr.umd.edu>> wrote:
> Ross,
>
> This is CentOS version 6.7 with kernel version 2.6.32-573.22.1.el6.x86_64.
>
> The output of nvidia-smi is:
>
> +------------------------------------------------------+
> | NVIDIA-SMI 352.79 Driver Version: 352.79 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> |===============================+======================+======================|
> | 0 Tesla K80 Off | 0000:05:00.0 Off | Off |
> | N/A 34C P0 59W / 149W | 56MiB / 12287MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 1 Tesla K80 Off | 0000:06:00.0 Off | Off |
> | N/A 27C P0 48W / 149W | 56MiB / 12287MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU Memory |
> | GPU PID Type Process name Usage |
> |=============================================================================|
> | No running processes found |
> +-----------------------------------------------------------------------------+
>
> The version of nvcc:
>
> nvcc: NVIDIA (R) Cuda compiler driver
> Copyright (c) 2005-2015 NVIDIA Corporation
> Built on Tue_Aug_11_14:27:32_CDT_2015
> Cuda compilation tools, release 7.5, V7.5.17
>
> I used the GNU compilers, version 4.4.7.
>
> I am using OpenMPI version 1.8.1-5.el6 from the CentOS repository. I have not tried any other MPI installation.
>
> Output of mpif90 --showme:
>
> gfortran -I/usr/include/openmpi-x86_64 -pthread -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi
>
>
> I set DO_PARALLEL to "mpirun -np 2"
>
> The parallel tests for the CPU were all successful.
>
> I had not run 'make clean' in between each step. I tried the tests again this morning after running 'make clean' and got the same result.
>
> I applied all patches this morning before testing again. I am using AmberTools 16.10 and Amber 16.04
>
>
> Thanks,
>
> Steve
>
> On Sat, Jul 23, 2016 at 6:32 PM, Ross Walker <ross.rosswalker.co.uk <mailto:ross.rosswalker.co.uk>> wrote:
> Hi Steven,
>
> This is a large number of very worrying failures. Something is definitely very wrong here and I'd like to investigate further. Can you give me some more details about your system please. This includes:
>
> The specifics of what version of Linux you are using.
>
> The output of nvidia-smi
>
> nvcc -V (might be lower case v to get version info).
>
> Did you use the GNU compilers or the Intel compilers and in either case which version?
>
> OpenMPI - can you confirm the version again and also send me the output of mpif90 --showme (it might be --show or -show or something similar) - essentially I want to see what the underlying compilation line is.
>
> Can you confirm what you had $DO_PARALLEL set to when you ran make test for the parallel GPU build. Also can you confirm if the regular (CPU) parallel build passed the tests please?
>
> Also did you run 'make clean' before each build step? E.g.
>
> ./configure -cuda gnu
> make -j8 install
> make test
> make clean
>
> ./configure -cuda -mpi gnu
> make -j8 install
> make test
>
> Have you tried any other MPI installations? - E.g. MPICH?
>
> And finally can you please confirm which version of Amber (and AmberTools) this is and which patches have been applied?
>
> Thanks.
>
> All the best
> Ross
>
>> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu <mailto:sford123.ibbr.umd.edu>> wrote:
>>
>> Ross,
>>
>> Attached are the log and diff files. Thank you for taking a look.
>>
>> Regards,
>>
>> Steve
>>
>> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk <mailto:ross.rosswalker.co.uk>> wrote:
>> Hi Steve,
>>
>> Indeed that is too big a difference to just be rounding error - although if those tests are using Langevin or Anderson for the thermostat that would explain it (different random number streams) - although those tests are supposed to be skipped in parallel.
>>
>> Can you send me a copy directly of your .log and .dif files for the 2 GPU run and I'll take a closer look at it.
>>
>> All the best
>> Ross
>>
>> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu <mailto:sford123.ibbr.umd.edu>> wrote:
>> >
>> > Hello All,
>> >
>> > I currently trying to get Amber16 installed and running on our computing
>> > cluster. Our researchers are primarily interested in running the GPU
>> > accelerated programs. For GPU computing jobs, we have one CentOS 6.7 node
>> > with a Tesla K80.
>> >
>> > I was able to build Amber16 and run the Serial/Parallel CPU plus the Serial
>> > GPU tests with all file comparisons passing. However, only 5 parallel GPU
>> > tests succeeded, while the other 100 comparisons failed.
>> >
>> > Examining the diff file shows that some of the numbers are not off by much
>> > like the documentation said could happen. For example:
>> >
>> > 66c66
>> > < NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 351.27 PRESS =
>> > 0.
>> >> NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 353.29 PRESS =
>> > 0.
>> >
>> > This may also be too large to attribute to a rounding error, but it is a
>> > small difference compared to others:
>> >
>> > 85c85
>> > < Etot = -217.1552 EKtot = 238.6655 EPtot =
>> > -455.8207
>> >> Etot = -1014.2562 EKtot = 244.6242 EPtot =
>> > -1258.8804
>> >
>> > This was build with CUDA 7.5, OpenMPI 1.8, and run with DO_PARALLEL="mpirun
>> > -np 2"
>> >
>> > Any idea what else could be affecting the output?
>> >
>> > Thanks,
>> >
>> > Steve
>> >
>> > --
>> > Steven Ford
>> > IT Infrastructure Specialist
>> > Institute for Bioscience and Biotechnology Research
>> > University of Maryland
>> > (240)314-6405 <tel:%28240%29314-6405>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>> > http://lists.ambermd.org/mailman/listinfo/amber <http://lists.ambermd.org/mailman/listinfo/amber>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>> http://lists.ambermd.org/mailman/listinfo/amber <http://lists.ambermd.org/mailman/listinfo/amber>
>>
>>
>>
>> --
>> Steven Ford
>> IT Infrastructure Specialist
>> Institute for Bioscience and Biotechnology Research
>> University of Maryland
>> (240)314-6405 <tel:%28240%29314-6405>
>> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>
>
>
>
>
> --
> Steven Ford
> IT Infrastructure Specialist
> Institute for Bioscience and Biotechnology Research
> University of Maryland
> (240)314-6405 <tel:%28240%29314-6405>
>
>
>
> --
> Steven Ford
> IT Infrastructure Specialist
> Institute for Bioscience and Biotechnology Research
> University of Maryland
> (240)314-6405

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 11 2016 - 20:30:02 PDT