The motherboard is SuperMicro.
On Fri, Aug 12, 2016 at 12:33 AM, Steven Ford <sford123.ibbr.umd.edu> wrote:
> Ross,
>
> Thanks, I will look for bios updates. Firmware aside, is there any
> configuration in the bios that would affect this?
>
> Thanks,
>
> Steve
>
> On Aug 12, 2016 12:29 AM, "Ross Walker" <ross.rosswalker.co.uk> wrote:
>
>> Hi Steven,
>>
>> Ah I thought you meant you had 4 GPUs as in 2 K80s rather than a single
>> K80 card that contains 2 GPUs.
>>
>> Either way this shows your hardware is incorrectly configured / has a
>> buggy bios. Who makes it? You probably need to go back to them and get an
>> updated bios that properly handles peer to peer communication.
>>
>> You could also check the motherboard manufacturer and see if they have an
>> up to date bios that fixes this bug.
>>
>> All those entries reported by lspci should have a minus after them if
>> things are correct in the bios.
>>
>> All the best
>> Ross
>>
>> On Aug 11, 2016, at 9:21 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>
>> Ross,
>>
>> The output of lspci -d "10b5:*" -vvv | grep ACSCtl is:
>>
>>  ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
>> EgressCtrl- DirectTrans-
>>  ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
>> EgressCtrl- DirectTrans-
>>
>>
>> With CUDA_VISIBLE_DEVICES unset:
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA-capable device count: 2
>> > GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
>> > GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
>>
>> Checking GPU(s) for support of peer to peer memory access...
>> > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>> > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>> Enabling peer access between GPU0 and GPU1...
>> Checking GPU0 and GPU1 for UVA capabilities...
>> > Tesla K80 (GPU0) supports UVA: Yes
>> > Tesla K80 (GPU1) supports UVA: Yes
>> Both GPUs can support UVA, enabling...
>> Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
>> Creating event handles...
>> cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.11GB/s
>> Preparing host buffer and memcpy to GPU0...
>> Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
>> Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
>> Copy data back to host from GPU0 and verify results...
>> Verification error . element 0: val = nan, ref = 0.000000
>> Verification error . element 1: val = nan, ref = 4.000000
>> Verification error . element 2: val = nan, ref = 8.000000
>> Verification error . element 3: val = nan, ref = 12.000000
>> Verification error . element 4: val = nan, ref = 16.000000
>> Verification error . element 5: val = nan, ref = 20.000000
>> Verification error . element 6: val = nan, ref = 24.000000
>> Verification error . element 7: val = nan, ref = 28.000000
>> Verification error . element 8: val = nan, ref = 32.000000
>> Verification error . element 9: val = nan, ref = 36.000000
>> Verification error . element 10: val = nan, ref = 40.000000
>> Verification error . element 11: val = nan, ref = 44.000000
>> Disabling peer access...
>> Shutting down...
>> Test failed!
>>
>> With CUDA_VISIBLE_DEVICES=0,1
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA-capable device count: 2
>> > GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
>> > GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
>>
>> Checking GPU(s) for support of peer to peer memory access...
>> > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>> > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>> Enabling peer access between GPU0 and GPU1...
>> Checking GPU0 and GPU1 for UVA capabilities...
>> > Tesla K80 (GPU0) supports UVA: Yes
>> > Tesla K80 (GPU1) supports UVA: Yes
>> Both GPUs can support UVA, enabling...
>> Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
>> Creating event handles...
>> cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.11GB/s
>> Preparing host buffer and memcpy to GPU0...
>> Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
>> Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
>> Copy data back to host from GPU0 and verify results...
>> Verification error . element 0: val = nan, ref = 0.000000
>> Verification error . element 1: val = nan, ref = 4.000000
>> Verification error . element 2: val = nan, ref = 8.000000
>> Verification error . element 3: val = nan, ref = 12.000000
>> Verification error . element 4: val = nan, ref = 16.000000
>> Verification error . element 5: val = nan, ref = 20.000000
>> Verification error . element 6: val = nan, ref = 24.000000
>> Verification error . element 7: val = nan, ref = 28.000000
>> Verification error . element 8: val = nan, ref = 32.000000
>> Verification error . element 9: val = nan, ref = 36.000000
>> Verification error . element 10: val = nan, ref = 40.000000
>> Verification error . element 11: val = nan, ref = 44.000000
>> Disabling peer access...
>> Shutting down...
>> Test failed!
>>
>>
>> With CUDA_VISIBLE_DEVICES=2,3
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA error at simpleP2P.cu:63 code=38(cudaErrorNoDevice)
>> "cudaGetDeviceCount(&gpu_n)"
>>
>>
>> and with CUDA_VISIBLE_DEVICES=0,2
>>
>> CUDA-capable device count: 1
>> Two or more GPUs with SM 2.0 or higher capability are required for
>> ./simpleP2P.
>> Waiving test.
>>
>>
>> I'm guessing the last two test fail because I have only one card with two
>> K80 GPUs on it, so no devices 2 or 3. Seems like something's awry with the
>> peer to peer communication between 0 and 1. Is it possible for them to be
>> on different PCIe domains even though they are on the same physical card?
>>
>> This makes me wonder: If each PCIe slot is connected to one CPU, should
>> this system either use only one CPU or have another K80 in the other PCIe
>> slot that's connected to the other CPU?
>>
>> If it helps, nvidia-smi topo -m shows:
>>
>>  GPU0    GPU1    CPU Affinity
>> GPU0     X      PIX     0-7,16-23
>> GPU1    PIX      X      0-7,16-23
>>
>>
>> Thanks again,
>>
>> Steve
>>
>> On Thu, Aug 11, 2016 at 11:17 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> I suspect your hardware is misconfigured. Can you run a couple of tests
>>> please.
>>>
>>> With CUDA_VISIBLE_DEVICES unset
>>>
>>> 1) As root run: lspci -d "10b5:*" -vvv | grep ACSCtl
>>>
>>> and post the output here.
>>>
>>> 2) Compile the CUDA samples installed as part of CUDA 7.5 and then run
>>> the following:
>>>
>>> unset CUDA_VISIBLE_DEVICES
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=0,1
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=2,3
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=0,2
>>> ./simpleP2P
>>>
>>> And post the results here.
>>>
>>> My suspicion is that your two K80s are on different PCI-E domains
>>> connected to different CPU sockets BUT your bios is misconfigured such that
>>> it is incorrectly reporting that the two K80s can talk to each other via
>>> P2P. Thus the first two simpleP2P runs above should pass. The last one will
>>> likely report that P2P is possible but then the bandwidth will be very low
>>> and it will ultimately fail the test because the array received by GPU 2
>>> will be garbage.
>>>
>>> If my suspicions are correct you would find the following behavior with
>>> AMBER
>>>
>>> 4 x 1 GPU runs, one on each GPU would be fine.
>>> (1 or 2) x 2 GPU runs will be fine if you use GPUS 0,1 and 2,3 but will
>>> fail if you were to use 0,2 - 0,3 - 1,2 or 1,3
>>> 1 x 4 GPU runs will fail unless you restrict it to GPUs 0,1 or 2,3 and
>>> thus overload the GPUs.
>>>
>>> Ps. nvidia-smi reporting 2 threads per mpi task is not an issue - it to
>>> be expected.
>>>
>>> All the best
>>> Ross
>>>
>>> On Aug 11, 2016, at 7:54 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>>
>>> Hello,
>>>
>>> I'm still trying to figure out why the MPI CUDA tests are failing.
>>>
>>> If I run tests with DO_PARALLEL="mpirun -np 4" and limit
>>> CUDA_VISIBLE_DEVICES to only 0 or 1, all tests pass. I get the same
>>> behavior with OpenMPI 1.8, 1.10, 2.0 and mpich 3.1.
>>>
>>> I ran gpuP2PCheck just in case communication between the GPUs was the
>>> problem. It confirms that communication is working:
>>>
>>> CUDA-capable device count: 2
>>>    GPU0 "      Tesla K80"
>>>    GPU1 "      Tesla K80"
>>>
>>> Two way peer access between:
>>>    GPU0 and GPU1: YES
>>>
>>> If it's of any use, here is the output of nvidia-smi -q:
>>>
>>> ==============NVSMI LOG==============
>>>
>>> Timestamp                           : Thu Aug 11 22:42:34 2016
>>> Driver Version                      : 352.93
>>>
>>> Attached GPUs                       : 2
>>> GPU 0000:05:00.0
>>>     Product Name                    : Tesla K80
>>>     Product Brand                   : Tesla
>>>     Display Mode                    : Disabled
>>>     Display Active                  : Disabled
>>>     Persistence Mode                : Disabled
>>>     Accounting Mode                 : Disabled
>>>     Accounting Mode Buffer Size     : 1920
>>>     Driver Model
>>>         Current                     : N/A
>>>         Pending                     : N/A
>>>     Serial Number                   : 0325015055313
>>>     GPU UUID                        : GPU-a65eaa77-8871-ded5-b6ee-52
>>> 68404192f1
>>>     Minor Number                    : 0
>>>     VBIOS Version                   : 80.21.1B.00.01
>>>     MultiGPU Board                  : Yes
>>>     Board ID                        : 0x300
>>>     Inforom Version
>>>         Image Version               : 2080.0200.00.04
>>>         OEM Object                  : 1.1
>>>         ECC Object                  : 3.0
>>>         Power Management Object     : N/A
>>>     GPU Operation Mode
>>>         Current                     : N/A
>>>         Pending                     : N/A
>>>     PCI
>>>         Bus                         : 0x05
>>>         Device                      : 0x00
>>>         Domain                      : 0x0000
>>>         Device Id                   : 0x102D10DE
>>>         Bus Id                      : 0000:05:00.0
>>>         Sub System Id               : 0x106C10DE
>>>         GPU Link Info
>>>             PCIe Generation
>>>                 Max                 : 3
>>>                 Current             : 3
>>>             Link Width
>>>                 Max                 : 16x
>>>                 Current             : 16x
>>>         Bridge Chip
>>>             Type                    : PLX
>>>             Firmware                : 0xF0472900
>>>         Replays since reset         : 0
>>>         Tx Throughput               : N/A
>>>         Rx Throughput               : N/A
>>>     Fan Speed                       : N/A
>>>     Performance State               : P0
>>>     Clocks Throttle Reasons
>>>         Idle                        : Not Active
>>>         Applications Clocks Setting : Active
>>>         SW Power Cap                : Not Active
>>>         HW Slowdown                 : Not Active
>>>         Unknown                     : Not Active
>>>     FB Memory Usage
>>>         Total                       : 12287 MiB
>>>         Used                        : 56 MiB
>>>         Free                        : 12231 MiB
>>>     BAR1 Memory Usage
>>>         Total                       : 16384 MiB
>>>         Used                        : 2 MiB
>>>         Free                        : 16382 MiB
>>>     Compute Mode                    : Default
>>>     Utilization
>>>         Gpu                         : 0 %
>>>         Memory                      : 0 %
>>>         Encoder                     : 0 %
>>>         Decoder                     : 0 %
>>>     Ecc Mode
>>>         Current                     : Disabled
>>>         Pending                     : Disabled
>>>     ECC Errors
>>>         Volatile
>>>             Single Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>             Double Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>         Aggregate
>>>             Single Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>             Double Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>     Retired Pages
>>>         Single Bit ECC              : 0
>>>         Double Bit ECC              : 0
>>>         Pending                     : No
>>>     Temperature
>>>         GPU Current Temp            : 31 C
>>>         GPU Shutdown Temp           : 93 C
>>>         GPU Slowdown Temp           : 88 C
>>>     Power Readings
>>>         Power Management            : Supported
>>>         Power Draw                  : 59.20 W
>>>         Power Limit                 : 149.00 W
>>>         Default Power Limit         : 149.00 W
>>>         Enforced Power Limit        : 149.00 W
>>>         Min Power Limit             : 100.00 W
>>>         Max Power Limit             : 175.00 W
>>>     Clocks
>>>         Graphics                    : 562 MHz
>>>         SM                          : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Applications Clocks
>>>         Graphics                    : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Default Applications Clocks
>>>         Graphics                    : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Max Clocks
>>>         Graphics                    : 875 MHz
>>>         SM                          : 875 MHz
>>>         Memory                      : 2505 MHz
>>>     Clock Policy
>>>         Auto Boost                  : On
>>>         Auto Boost Default          : On
>>>     Processes                       : None
>>>
>>> GPU 0000:06:00.0
>>>     Product Name                    : Tesla K80
>>>     Product Brand                   : Tesla
>>>     Display Mode                    : Disabled
>>>     Display Active                  : Disabled
>>>     Persistence Mode                : Disabled
>>>     Accounting Mode                 : Disabled
>>>     Accounting Mode Buffer Size     : 1920
>>>     Driver Model
>>>         Current                     : N/A
>>>         Pending                     : N/A
>>>     Serial Number                   : 0325015055313
>>>     GPU UUID                        : GPU-21c2be1c-72a9-1b68-adab-45
>>> 9d05dd7adc
>>>     Minor Number                    : 1
>>>     VBIOS Version                   : 80.21.1B.00.02
>>>     MultiGPU Board                  : Yes
>>>     Board ID                        : 0x300
>>>     Inforom Version
>>>         Image Version               : 2080.0200.00.04
>>>         OEM Object                  : 1.1
>>>         ECC Object                  : 3.0
>>>         Power Management Object     : N/A
>>>     GPU Operation Mode
>>>         Current                     : N/A
>>>         Pending                     : N/A
>>>     PCI
>>>         Bus                         : 0x06
>>>         Device                      : 0x00
>>>         Domain                      : 0x0000
>>>         Device Id                   : 0x102D10DE
>>>         Bus Id                      : 0000:06:00.0
>>>         Sub System Id               : 0x106C10DE
>>>         GPU Link Info
>>>             PCIe Generation
>>>                 Max                 : 3
>>>                 Current             : 3
>>>             Link Width
>>>                 Max                 : 16x
>>>                 Current             : 16x
>>>         Bridge Chip
>>>             Type                    : PLX
>>>             Firmware                : 0xF0472900
>>>         Replays since reset         : 0
>>>         Tx Throughput               : N/A
>>>         Rx Throughput               : N/A
>>>     Fan Speed                       : N/A
>>>     Performance State               : P0
>>>     Clocks Throttle Reasons
>>>         Idle                        : Not Active
>>>         Applications Clocks Setting : Active
>>>         SW Power Cap                : Not Active
>>>         HW Slowdown                 : Not Active
>>>         Unknown                     : Not Active
>>>     FB Memory Usage
>>>         Total                       : 12287 MiB
>>>         Used                        : 56 MiB
>>>         Free                        : 12231 MiB
>>>     BAR1 Memory Usage
>>>         Total                       : 16384 MiB
>>>         Used                        : 2 MiB
>>>         Free                        : 16382 MiB
>>>     Compute Mode                    : Default
>>>     Utilization
>>>         Gpu                         : 0 %
>>>         Memory                      : 0 %
>>>         Encoder                     : 0 %
>>>         Decoder                     : 0 %
>>>     Ecc Mode
>>>         Current                     : Disabled
>>>         Pending                     : Disabled
>>>     ECC Errors
>>>         Volatile
>>>             Single Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>             Double Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>         Aggregate
>>>             Single Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>             Double Bit
>>>                 Device Memory       : N/A
>>>                 Register File       : N/A
>>>                 L1 Cache            : N/A
>>>                 L2 Cache            : N/A
>>>                 Texture Memory      : N/A
>>>                 Total               : N/A
>>>     Retired Pages
>>>         Single Bit ECC              : 0
>>>         Double Bit ECC              : 0
>>>         Pending                     : No
>>>     Temperature
>>>         GPU Current Temp            : 24 C
>>>         GPU Shutdown Temp           : 93 C
>>>         GPU Slowdown Temp           : 88 C
>>>     Power Readings
>>>         Power Management            : Supported
>>>         Power Draw                  : 70.89 W
>>>         Power Limit                 : 149.00 W
>>>         Default Power Limit         : 149.00 W
>>>         Enforced Power Limit        : 149.00 W
>>>         Min Power Limit             : 100.00 W
>>>         Max Power Limit             : 175.00 W
>>>     Clocks
>>>         Graphics                    : 562 MHz
>>>         SM                          : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Applications Clocks
>>>         Graphics                    : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Default Applications Clocks
>>>         Graphics                    : 562 MHz
>>>         Memory                      : 2505 MHz
>>>     Max Clocks
>>>         Graphics                    : 875 MHz
>>>         SM                          : 875 MHz
>>>         Memory                      : 2505 MHz
>>>     Clock Policy
>>>         Auto Boost                  : On
>>>         Auto Boost Default          : On
>>>     Processes                       : None
>>>
>>>
>>> If it matters, when I do the tests with DO_PARALLEL="mpirun -np 4", I
>>> see that each process is running a thread on both GPUs. For example:
>>>
>>> # gpu     pid  type    sm   mem   enc   dec   command
>>> # Idx       #   C/G     %     %     %     %   name
>>>     0   30599     C    24     0     0     0   pmemd.cuda_DPFP
>>>     0   30600     C     0     0     0     0   pmemd.cuda_DPFP
>>>     0   30601     C    11     0     0     0   pmemd.cuda_DPFP
>>>     0   30602     C     0     0     0     0   pmemd.cuda_DPFP
>>>     1   30599     C     0     0     0     0   pmemd.cuda_DPFP
>>>     1   30600     C    36     0     0     0   pmemd.cuda_DPFP
>>>     1   30601     C     0     0     0     0   pmemd.cuda_DPFP
>>>     1   30602     C     6     0     0     0   pmemd.cuda_DPFP
>>>
>>> Is that expected behavior?
>>>
>>> Has anybody else had any problems using K80s with MPI and CUDA? Or using
>>> CentOS/RHEL 6?
>>>
>>> This machine does have dual CPUs, could that be a factor?
>>>
>>> I'm currently using  AmberTools version 16.12 and Amber version 16.05.
>>>
>>> Any insight would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>>
>>>
>>> On Mon, Jul 25, 2016 at 3:06 PM, Steven Ford <sford123.ibbr.umd.edu>
>>> wrote:
>>>
>>>> Ross,
>>>>
>>>> This is CentOS version 6.7 with kernel version
>>>> 2.6.32-573.22.1.el6.x86_64.
>>>>
>>>> The output of nvidia-smi is:
>>>>
>>>> +------------------------------------------------------+
>>>>
>>>> | NVIDIA-SMI 352.79     Driver Version: 352.79         |
>>>>
>>>> |-------------------------------+----------------------+----
>>>> ------------------+
>>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>>> Uncorr. ECC |
>>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>>>  Compute M. |
>>>> |===============================+======================+====
>>>> ==================|
>>>> |   0  Tesla K80           Off  | 0000:05:00.0     Off |
>>>>    Off |
>>>> | N/A   34C    P0    59W / 149W |     56MiB / 12287MiB |      0%
>>>>  Default |
>>>> +-------------------------------+----------------------+----
>>>> ------------------+
>>>> |   1  Tesla K80           Off  | 0000:06:00.0     Off |
>>>>    Off |
>>>> | N/A   27C    P0    48W / 149W |     56MiB / 12287MiB |      0%
>>>>  Default |
>>>> +-------------------------------+----------------------+----
>>>> ------------------+
>>>>
>>>>
>>>> +-----------------------------------------------------------
>>>> ------------------+
>>>> | Processes:                                                       GPU
>>>> Memory |
>>>> |  GPU       PID  Type  Process name
>>>> Usage      |
>>>> |===========================================================
>>>> ==================|
>>>> |  No running processes found
>>>>       |
>>>> +-----------------------------------------------------------
>>>> ------------------+
>>>>
>>>> The version of nvcc:
>>>>
>>>> nvcc: NVIDIA (R) Cuda compiler driver
>>>> Copyright (c) 2005-2015 NVIDIA Corporation
>>>> Built on Tue_Aug_11_14:27:32_CDT_2015
>>>> Cuda compilation tools, release 7.5, V7.5.17
>>>>
>>>> I used the GNU compilers, version 4.4.7.
>>>>
>>>> I am using OpenMPI version 1.8.1-5.el6 from the CentOS repository. I
>>>> have not tried any other MPI installation.
>>>>
>>>> Output of mpif90 --showme:
>>>>
>>>> gfortran -I/usr/include/openmpi-x86_64 -pthread
>>>> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
>>>> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
>>>> -lmpi
>>>>
>>>>
>>>> I set DO_PARALLEL to "mpirun -np 2"
>>>>
>>>> The parallel tests for the CPU were all successful.
>>>>
>>>> I had not run 'make clean' in between each step. I tried the tests
>>>> again this morning after running 'make clean' and got the same result.
>>>>
>>>> I applied all patches this morning before testing again. I am using
>>>> AmberTools 16.10 and Amber 16.04
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Steve
>>>>
>>>> On Sat, Jul 23, 2016 at 6:32 PM, Ross Walker <ross.rosswalker.co.uk>
>>>> wrote:
>>>>
>>>>> Hi Steven,
>>>>>
>>>>> This is a large number of very worrying failures. Something is
>>>>> definitely very wrong here and I'd like to investigate further. Can you
>>>>> give me some more details about your system please. This includes:
>>>>>
>>>>> The specifics of what version of Linux you are using.
>>>>>
>>>>> The output of nvidia-smi
>>>>>
>>>>> nvcc -V   (might be lower case v to get version info).
>>>>>
>>>>> Did you use the GNU compilers or the Intel compilers and in either
>>>>> case which version?
>>>>>
>>>>> OpenMPI - can you confirm the version again and also send me the
>>>>> output of mpif90 --showme (it might be --show or -show or something
>>>>> similar) - essentially I want to see what the underlying compilation line
>>>>> is.
>>>>>
>>>>> Can you confirm what you had $DO_PARALLEL set to when you ran make
>>>>> test for the parallel GPU build. Also can you confirm if the regular (CPU)
>>>>> parallel build passed the tests please?
>>>>>
>>>>> Also did you run 'make clean' before each build step? E.g.
>>>>>
>>>>> ./configure -cuda gnu
>>>>> make -j8 install
>>>>> make test
>>>>> *make clean*
>>>>>
>>>>> ./configure -cuda -mpi gnu
>>>>> make -j8 install
>>>>> make test
>>>>>
>>>>> Have you tried any other MPI installations? - E.g. MPICH?
>>>>>
>>>>> And finally can you please confirm which version of Amber (and
>>>>> AmberTools) this is and which patches have been applied?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>>>>
>>>>> Ross,
>>>>>
>>>>> Attached are the log and diff files. Thank you for taking a look.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Steve
>>>>>
>>>>> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi Steve,
>>>>>>
>>>>>> Indeed that is too big a difference to just be rounding error -
>>>>>> although if those tests are using Langevin or Anderson for the thermostat
>>>>>> that would explain it (different random number streams) - although those
>>>>>> tests are supposed to be skipped in parallel.
>>>>>>
>>>>>> Can you send me a copy directly of your .log and .dif files for the 2
>>>>>> GPU run and I'll take a closer look at it.
>>>>>>
>>>>>> All the best
>>>>>> Ross
>>>>>>
>>>>>> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hello All,
>>>>>> >
>>>>>> > I currently trying to get Amber16 installed and running on our
>>>>>> computing
>>>>>> > cluster. Our researchers are primarily interested in running the GPU
>>>>>> > accelerated programs. For GPU computing jobs, we have one CentOS
>>>>>> 6.7 node
>>>>>> > with a Tesla K80.
>>>>>> >
>>>>>> > I was able to build Amber16 and run the Serial/Parallel CPU plus
>>>>>> the Serial
>>>>>> > GPU tests with all file comparisons passing. However, only 5
>>>>>> parallel GPU
>>>>>> > tests succeeded, while the other 100 comparisons failed.
>>>>>> >
>>>>>> > Examining the diff file shows that some of the numbers are not off
>>>>>> by much
>>>>>> > like the documentation said could happen. For example:
>>>>>> >
>>>>>> > 66c66
>>>>>> > <  NSTEP =        1   TIME(PS) =      50.002  TEMP(K) =   351.27
>>>>>> PRESS =
>>>>>> >  0.
>>>>>> >> NSTEP =        1   TIME(PS) =      50.002  TEMP(K) =   353.29
>>>>>> PRESS =
>>>>>> >  0.
>>>>>> >
>>>>>> > This may also be too large to attribute to a rounding error, but it
>>>>>> is a
>>>>>> > small difference compared to others:
>>>>>> >
>>>>>> > 85c85
>>>>>> > <  Etot   =      -217.1552  EKtot   =       238.6655  EPtot      =
>>>>>> > -455.8207
>>>>>> >> Etot   =     -1014.2562  EKtot   =       244.6242  EPtot      =
>>>>>> > -1258.8804
>>>>>> >
>>>>>> > This was build with CUDA 7.5, OpenMPI 1.8, and run with
>>>>>> DO_PARALLEL="mpirun
>>>>>> > -np 2"
>>>>>> >
>>>>>> > Any idea what else could be affecting the output?
>>>>>> >
>>>>>> > Thanks,
>>>>>> >
>>>>>> > Steve
>>>>>> >
>>>>>> > --
>>>>>> > Steven Ford
>>>>>> > IT Infrastructure Specialist
>>>>>> > Institute for Bioscience and Biotechnology Research
>>>>>> > University of Maryland
>>>>>> > (240)314-6405
>>>>>> > _______________________________________________
>>>>>> > AMBER mailing list
>>>>>> > AMBER.ambermd.org
>>>>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steven Ford
>>>>> IT Infrastructure Specialist
>>>>> Institute for Bioscience and Biotechnology Research
>>>>> University of Maryland
>>>>> (240)314-6405
>>>>> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Steven Ford
>>>> IT Infrastructure Specialist
>>>> Institute for Bioscience and Biotechnology Research
>>>> University of Maryland
>>>> (240)314-6405
>>>>
>>>
>>>
>>>
>>> --
>>> Steven Ford
>>> IT Infrastructure Specialist
>>> Institute for Bioscience and Biotechnology Research
>>> University of Maryland
>>> (240)314-6405
>>>
>>>
>>>
>>
>>
>> --
>> Steven Ford
>> IT Infrastructure Specialist
>> Institute for Bioscience and Biotechnology Research
>> University of Maryland
>> (240)314-6405
>>
>>
>>
-- 
Steven Ford
IT Infrastructure Specialist
Institute for Bioscience and Biotechnology Research
University of Maryland
(240)314-6405
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 11 2016 - 22:00:03 PDT