Re: [AMBER] Amber16 Parallel CUDA Tests from Steven Ford on 2016-08-11 (Amber Archive Aug 2016)

From: Steven Ford <sford123.ibbr.umd.edu>
Date: Fri, 12 Aug 2016 00:38:10 -0400

The motherboard is SuperMicro.

On Fri, Aug 12, 2016 at 12:33 AM, Steven Ford <sford123.ibbr.umd.edu> wrote:

> Ross,
>
> Thanks, I will look for bios updates. Firmware aside, is there any
> configuration in the bios that would affect this?
>
> Thanks,
>
> Steve
>
> On Aug 12, 2016 12:29 AM, "Ross Walker" <ross.rosswalker.co.uk> wrote:
>
>> Hi Steven,
>>
>> Ah I thought you meant you had 4 GPUs as in 2 K80s rather than a single
>> K80 card that contains 2 GPUs.
>>
>> Either way this shows your hardware is incorrectly configured / has a
>> buggy bios. Who makes it? You probably need to go back to them and get an
>> updated bios that properly handles peer to peer communication.
>>
>> You could also check the motherboard manufacturer and see if they have an
>> up to date bios that fixes this bug.
>>
>> All those entries reported by lspci should have a minus after them if
>> things are correct in the bios.
>>
>> All the best
>> Ross
>>
>> On Aug 11, 2016, at 9:21 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>
>> Ross,
>>
>> The output of lspci -d "10b5:*" -vvv | grep ACSCtl is:
>>
>> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
>> EgressCtrl- DirectTrans-
>> ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
>> EgressCtrl- DirectTrans-
>>
>>
>> With CUDA_VISIBLE_DEVICES unset:
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA-capable device count: 2
>> > GPU0 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
>> > GPU1 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
>>
>> Checking GPU(s) for support of peer to peer memory access...
>> > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>> > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>> Enabling peer access between GPU0 and GPU1...
>> Checking GPU0 and GPU1 for UVA capabilities...
>> > Tesla K80 (GPU0) supports UVA: Yes
>> > Tesla K80 (GPU1) supports UVA: Yes
>> Both GPUs can support UVA, enabling...
>> Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
>> Creating event handles...
>> cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.11GB/s
>> Preparing host buffer and memcpy to GPU0...
>> Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
>> Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
>> Copy data back to host from GPU0 and verify results...
>> Verification error . element 0: val = nan, ref = 0.000000
>> Verification error . element 1: val = nan, ref = 4.000000
>> Verification error . element 2: val = nan, ref = 8.000000
>> Verification error . element 3: val = nan, ref = 12.000000
>> Verification error . element 4: val = nan, ref = 16.000000
>> Verification error . element 5: val = nan, ref = 20.000000
>> Verification error . element 6: val = nan, ref = 24.000000
>> Verification error . element 7: val = nan, ref = 28.000000
>> Verification error . element 8: val = nan, ref = 32.000000
>> Verification error . element 9: val = nan, ref = 36.000000
>> Verification error . element 10: val = nan, ref = 40.000000
>> Verification error . element 11: val = nan, ref = 44.000000
>> Disabling peer access...
>> Shutting down...
>> Test failed!
>>
>> With CUDA_VISIBLE_DEVICES=0,1
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA-capable device count: 2
>> > GPU0 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
>> > GPU1 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
>>
>> Checking GPU(s) for support of peer to peer memory access...
>> > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>> > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>> Enabling peer access between GPU0 and GPU1...
>> Checking GPU0 and GPU1 for UVA capabilities...
>> > Tesla K80 (GPU0) supports UVA: Yes
>> > Tesla K80 (GPU1) supports UVA: Yes
>> Both GPUs can support UVA, enabling...
>> Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
>> Creating event handles...
>> cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.11GB/s
>> Preparing host buffer and memcpy to GPU0...
>> Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
>> Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
>> Copy data back to host from GPU0 and verify results...
>> Verification error . element 0: val = nan, ref = 0.000000
>> Verification error . element 1: val = nan, ref = 4.000000
>> Verification error . element 2: val = nan, ref = 8.000000
>> Verification error . element 3: val = nan, ref = 12.000000
>> Verification error . element 4: val = nan, ref = 16.000000
>> Verification error . element 5: val = nan, ref = 20.000000
>> Verification error . element 6: val = nan, ref = 24.000000
>> Verification error . element 7: val = nan, ref = 28.000000
>> Verification error . element 8: val = nan, ref = 32.000000
>> Verification error . element 9: val = nan, ref = 36.000000
>> Verification error . element 10: val = nan, ref = 40.000000
>> Verification error . element 11: val = nan, ref = 44.000000
>> Disabling peer access...
>> Shutting down...
>> Test failed!
>>
>>
>> With CUDA_VISIBLE_DEVICES=2,3
>>
>> [./simpleP2P] - Starting...
>> Checking for multiple GPUs...
>> CUDA error at simpleP2P.cu:63 code=38(cudaErrorNoDevice)
>> "cudaGetDeviceCount(&gpu_n)"
>>
>>
>> and with CUDA_VISIBLE_DEVICES=0,2
>>
>> CUDA-capable device count: 1
>> Two or more GPUs with SM 2.0 or higher capability are required for
>> ./simpleP2P.
>> Waiving test.
>>
>>
>> I'm guessing the last two test fail because I have only one card with two
>> K80 GPUs on it, so no devices 2 or 3. Seems like something's awry with the
>> peer to peer communication between 0 and 1. Is it possible for them to be
>> on different PCIe domains even though they are on the same physical card?
>>
>> This makes me wonder: If each PCIe slot is connected to one CPU, should
>> this system either use only one CPU or have another K80 in the other PCIe
>> slot that's connected to the other CPU?
>>
>> If it helps, nvidia-smi topo -m shows:
>>
>> GPU0 GPU1 CPU Affinity
>> GPU0 X PIX 0-7,16-23
>> GPU1 PIX X 0-7,16-23
>>
>>
>> Thanks again,
>>
>> Steve
>>
>> On Thu, Aug 11, 2016 at 11:17 PM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> I suspect your hardware is misconfigured. Can you run a couple of tests
>>> please.
>>>
>>> With CUDA_VISIBLE_DEVICES unset
>>>
>>> 1) As root run: lspci -d "10b5:*" -vvv | grep ACSCtl
>>>
>>> and post the output here.
>>>
>>> 2) Compile the CUDA samples installed as part of CUDA 7.5 and then run
>>> the following:
>>>
>>> unset CUDA_VISIBLE_DEVICES
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=0,1
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=2,3
>>> ./simpleP2P
>>>
>>> export CUDA_VISIBLE_DEVICES=0,2
>>> ./simpleP2P
>>>
>>> And post the results here.
>>>
>>> My suspicion is that your two K80s are on different PCI-E domains
>>> connected to different CPU sockets BUT your bios is misconfigured such that
>>> it is incorrectly reporting that the two K80s can talk to each other via
>>> P2P. Thus the first two simpleP2P runs above should pass. The last one will
>>> likely report that P2P is possible but then the bandwidth will be very low
>>> and it will ultimately fail the test because the array received by GPU 2
>>> will be garbage.
>>>
>>> If my suspicions are correct you would find the following behavior with
>>> AMBER
>>>
>>> 4 x 1 GPU runs, one on each GPU would be fine.
>>> (1 or 2) x 2 GPU runs will be fine if you use GPUS 0,1 and 2,3 but will
>>> fail if you were to use 0,2 - 0,3 - 1,2 or 1,3
>>> 1 x 4 GPU runs will fail unless you restrict it to GPUs 0,1 or 2,3 and
>>> thus overload the GPUs.
>>>
>>> Ps. nvidia-smi reporting 2 threads per mpi task is not an issue - it to
>>> be expected.
>>>
>>> All the best
>>> Ross
>>>
>>> On Aug 11, 2016, at 7:54 PM, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>>
>>> Hello,
>>>
>>> I'm still trying to figure out why the MPI CUDA tests are failing.
>>>
>>> If I run tests with DO_PARALLEL="mpirun -np 4" and limit
>>> CUDA_VISIBLE_DEVICES to only 0 or 1, all tests pass. I get the same
>>> behavior with OpenMPI 1.8, 1.10, 2.0 and mpich 3.1.
>>>
>>> I ran gpuP2PCheck just in case communication between the GPUs was the
>>> problem. It confirms that communication is working:
>>>
>>> CUDA-capable device count: 2
>>> GPU0 " Tesla K80"
>>> GPU1 " Tesla K80"
>>>
>>> Two way peer access between:
>>> GPU0 and GPU1: YES
>>>
>>> If it's of any use, here is the output of nvidia-smi -q:
>>>
>>> ==============NVSMI LOG==============
>>>
>>> Timestamp : Thu Aug 11 22:42:34 2016
>>> Driver Version : 352.93
>>>
>>> Attached GPUs : 2
>>> GPU 0000:05:00.0
>>> Product Name : Tesla K80
>>> Product Brand : Tesla
>>> Display Mode : Disabled
>>> Display Active : Disabled
>>> Persistence Mode : Disabled
>>> Accounting Mode : Disabled
>>> Accounting Mode Buffer Size : 1920
>>> Driver Model
>>> Current : N/A
>>> Pending : N/A
>>> Serial Number : 0325015055313
>>> GPU UUID : GPU-a65eaa77-8871-ded5-b6ee-52
>>> 68404192f1
>>> Minor Number : 0
>>> VBIOS Version : 80.21.1B.00.01
>>> MultiGPU Board : Yes
>>> Board ID : 0x300
>>> Inforom Version
>>> Image Version : 2080.0200.00.04
>>> OEM Object : 1.1
>>> ECC Object : 3.0
>>> Power Management Object : N/A
>>> GPU Operation Mode
>>> Current : N/A
>>> Pending : N/A
>>> PCI
>>> Bus : 0x05
>>> Device : 0x00
>>> Domain : 0x0000
>>> Device Id : 0x102D10DE
>>> Bus Id : 0000:05:00.0
>>> Sub System Id : 0x106C10DE
>>> GPU Link Info
>>> PCIe Generation
>>> Max : 3
>>> Current : 3
>>> Link Width
>>> Max : 16x
>>> Current : 16x
>>> Bridge Chip
>>> Type : PLX
>>> Firmware : 0xF0472900
>>> Replays since reset : 0
>>> Tx Throughput : N/A
>>> Rx Throughput : N/A
>>> Fan Speed : N/A
>>> Performance State : P0
>>> Clocks Throttle Reasons
>>> Idle : Not Active
>>> Applications Clocks Setting : Active
>>> SW Power Cap : Not Active
>>> HW Slowdown : Not Active
>>> Unknown : Not Active
>>> FB Memory Usage
>>> Total : 12287 MiB
>>> Used : 56 MiB
>>> Free : 12231 MiB
>>> BAR1 Memory Usage
>>> Total : 16384 MiB
>>> Used : 2 MiB
>>> Free : 16382 MiB
>>> Compute Mode : Default
>>> Utilization
>>> Gpu : 0 %
>>> Memory : 0 %
>>> Encoder : 0 %
>>> Decoder : 0 %
>>> Ecc Mode
>>> Current : Disabled
>>> Pending : Disabled
>>> ECC Errors
>>> Volatile
>>> Single Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Double Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Aggregate
>>> Single Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Double Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Retired Pages
>>> Single Bit ECC : 0
>>> Double Bit ECC : 0
>>> Pending : No
>>> Temperature
>>> GPU Current Temp : 31 C
>>> GPU Shutdown Temp : 93 C
>>> GPU Slowdown Temp : 88 C
>>> Power Readings
>>> Power Management : Supported
>>> Power Draw : 59.20 W
>>> Power Limit : 149.00 W
>>> Default Power Limit : 149.00 W
>>> Enforced Power Limit : 149.00 W
>>> Min Power Limit : 100.00 W
>>> Max Power Limit : 175.00 W
>>> Clocks
>>> Graphics : 562 MHz
>>> SM : 562 MHz
>>> Memory : 2505 MHz
>>> Applications Clocks
>>> Graphics : 562 MHz
>>> Memory : 2505 MHz
>>> Default Applications Clocks
>>> Graphics : 562 MHz
>>> Memory : 2505 MHz
>>> Max Clocks
>>> Graphics : 875 MHz
>>> SM : 875 MHz
>>> Memory : 2505 MHz
>>> Clock Policy
>>> Auto Boost : On
>>> Auto Boost Default : On
>>> Processes : None
>>>
>>> GPU 0000:06:00.0
>>> Product Name : Tesla K80
>>> Product Brand : Tesla
>>> Display Mode : Disabled
>>> Display Active : Disabled
>>> Persistence Mode : Disabled
>>> Accounting Mode : Disabled
>>> Accounting Mode Buffer Size : 1920
>>> Driver Model
>>> Current : N/A
>>> Pending : N/A
>>> Serial Number : 0325015055313
>>> GPU UUID : GPU-21c2be1c-72a9-1b68-adab-45
>>> 9d05dd7adc
>>> Minor Number : 1
>>> VBIOS Version : 80.21.1B.00.02
>>> MultiGPU Board : Yes
>>> Board ID : 0x300
>>> Inforom Version
>>> Image Version : 2080.0200.00.04
>>> OEM Object : 1.1
>>> ECC Object : 3.0
>>> Power Management Object : N/A
>>> GPU Operation Mode
>>> Current : N/A
>>> Pending : N/A
>>> PCI
>>> Bus : 0x06
>>> Device : 0x00
>>> Domain : 0x0000
>>> Device Id : 0x102D10DE
>>> Bus Id : 0000:06:00.0
>>> Sub System Id : 0x106C10DE
>>> GPU Link Info
>>> PCIe Generation
>>> Max : 3
>>> Current : 3
>>> Link Width
>>> Max : 16x
>>> Current : 16x
>>> Bridge Chip
>>> Type : PLX
>>> Firmware : 0xF0472900
>>> Replays since reset : 0
>>> Tx Throughput : N/A
>>> Rx Throughput : N/A
>>> Fan Speed : N/A
>>> Performance State : P0
>>> Clocks Throttle Reasons
>>> Idle : Not Active
>>> Applications Clocks Setting : Active
>>> SW Power Cap : Not Active
>>> HW Slowdown : Not Active
>>> Unknown : Not Active
>>> FB Memory Usage
>>> Total : 12287 MiB
>>> Used : 56 MiB
>>> Free : 12231 MiB
>>> BAR1 Memory Usage
>>> Total : 16384 MiB
>>> Used : 2 MiB
>>> Free : 16382 MiB
>>> Compute Mode : Default
>>> Utilization
>>> Gpu : 0 %
>>> Memory : 0 %
>>> Encoder : 0 %
>>> Decoder : 0 %
>>> Ecc Mode
>>> Current : Disabled
>>> Pending : Disabled
>>> ECC Errors
>>> Volatile
>>> Single Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Double Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Aggregate
>>> Single Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Double Bit
>>> Device Memory : N/A
>>> Register File : N/A
>>> L1 Cache : N/A
>>> L2 Cache : N/A
>>> Texture Memory : N/A
>>> Total : N/A
>>> Retired Pages
>>> Single Bit ECC : 0
>>> Double Bit ECC : 0
>>> Pending : No
>>> Temperature
>>> GPU Current Temp : 24 C
>>> GPU Shutdown Temp : 93 C
>>> GPU Slowdown Temp : 88 C
>>> Power Readings
>>> Power Management : Supported
>>> Power Draw : 70.89 W
>>> Power Limit : 149.00 W
>>> Default Power Limit : 149.00 W
>>> Enforced Power Limit : 149.00 W
>>> Min Power Limit : 100.00 W
>>> Max Power Limit : 175.00 W
>>> Clocks
>>> Graphics : 562 MHz
>>> SM : 562 MHz
>>> Memory : 2505 MHz
>>> Applications Clocks
>>> Graphics : 562 MHz
>>> Memory : 2505 MHz
>>> Default Applications Clocks
>>> Graphics : 562 MHz
>>> Memory : 2505 MHz
>>> Max Clocks
>>> Graphics : 875 MHz
>>> SM : 875 MHz
>>> Memory : 2505 MHz
>>> Clock Policy
>>> Auto Boost : On
>>> Auto Boost Default : On
>>> Processes : None
>>>
>>>
>>> If it matters, when I do the tests with DO_PARALLEL="mpirun -np 4", I
>>> see that each process is running a thread on both GPUs. For example:
>>>
>>> # gpu pid type sm mem enc dec command
>>> # Idx # C/G % % % % name
>>> 0 30599 C 24 0 0 0 pmemd.cuda_DPFP
>>> 0 30600 C 0 0 0 0 pmemd.cuda_DPFP
>>> 0 30601 C 11 0 0 0 pmemd.cuda_DPFP
>>> 0 30602 C 0 0 0 0 pmemd.cuda_DPFP
>>> 1 30599 C 0 0 0 0 pmemd.cuda_DPFP
>>> 1 30600 C 36 0 0 0 pmemd.cuda_DPFP
>>> 1 30601 C 0 0 0 0 pmemd.cuda_DPFP
>>> 1 30602 C 6 0 0 0 pmemd.cuda_DPFP
>>>
>>> Is that expected behavior?
>>>
>>> Has anybody else had any problems using K80s with MPI and CUDA? Or using
>>> CentOS/RHEL 6?
>>>
>>> This machine does have dual CPUs, could that be a factor?
>>>
>>> I'm currently using AmberTools version 16.12 and Amber version 16.05.
>>>
>>> Any insight would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>>
>>>
>>> On Mon, Jul 25, 2016 at 3:06 PM, Steven Ford <sford123.ibbr.umd.edu>
>>> wrote:
>>>
>>>> Ross,
>>>>
>>>> This is CentOS version 6.7 with kernel version
>>>> 2.6.32-573.22.1.el6.x86_64.
>>>>
>>>> The output of nvidia-smi is:
>>>>
>>>> +------------------------------------------------------+
>>>>
>>>> | NVIDIA-SMI 352.79 Driver Version: 352.79 |
>>>>
>>>> |-------------------------------+----------------------+----
>>>> ------------------+
>>>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
>>>> Uncorr. ECC |
>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>>>> Compute M. |
>>>> |===============================+======================+====
>>>> ==================|
>>>> | 0 Tesla K80 Off | 0000:05:00.0 Off |
>>>> Off |
>>>> | N/A 34C P0 59W / 149W | 56MiB / 12287MiB | 0%
>>>> Default |
>>>> +-------------------------------+----------------------+----
>>>> ------------------+
>>>> | 1 Tesla K80 Off | 0000:06:00.0 Off |
>>>> Off |
>>>> | N/A 27C P0 48W / 149W | 56MiB / 12287MiB | 0%
>>>> Default |
>>>> +-------------------------------+----------------------+----
>>>> ------------------+
>>>>
>>>>
>>>> +-----------------------------------------------------------
>>>> ------------------+
>>>> | Processes: GPU
>>>> Memory |
>>>> | GPU PID Type Process name
>>>> Usage |
>>>> |===========================================================
>>>> ==================|
>>>> | No running processes found
>>>> |
>>>> +-----------------------------------------------------------
>>>> ------------------+
>>>>
>>>> The version of nvcc:
>>>>
>>>> nvcc: NVIDIA (R) Cuda compiler driver
>>>> Copyright (c) 2005-2015 NVIDIA Corporation
>>>> Built on Tue_Aug_11_14:27:32_CDT_2015
>>>> Cuda compilation tools, release 7.5, V7.5.17
>>>>
>>>> I used the GNU compilers, version 4.4.7.
>>>>
>>>> I am using OpenMPI version 1.8.1-5.el6 from the CentOS repository. I
>>>> have not tried any other MPI installation.
>>>>
>>>> Output of mpif90 --showme:
>>>>
>>>> gfortran -I/usr/include/openmpi-x86_64 -pthread
>>>> -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
>>>> -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
>>>> -lmpi
>>>>
>>>>
>>>> I set DO_PARALLEL to "mpirun -np 2"
>>>>
>>>> The parallel tests for the CPU were all successful.
>>>>
>>>> I had not run 'make clean' in between each step. I tried the tests
>>>> again this morning after running 'make clean' and got the same result.
>>>>
>>>> I applied all patches this morning before testing again. I am using
>>>> AmberTools 16.10 and Amber 16.04
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Steve
>>>>
>>>> On Sat, Jul 23, 2016 at 6:32 PM, Ross Walker <ross.rosswalker.co.uk>
>>>> wrote:
>>>>
>>>>> Hi Steven,
>>>>>
>>>>> This is a large number of very worrying failures. Something is
>>>>> definitely very wrong here and I'd like to investigate further. Can you
>>>>> give me some more details about your system please. This includes:
>>>>>
>>>>> The specifics of what version of Linux you are using.
>>>>>
>>>>> The output of nvidia-smi
>>>>>
>>>>> nvcc -V (might be lower case v to get version info).
>>>>>
>>>>> Did you use the GNU compilers or the Intel compilers and in either
>>>>> case which version?
>>>>>
>>>>> OpenMPI - can you confirm the version again and also send me the
>>>>> output of mpif90 --showme (it might be --show or -show or something
>>>>> similar) - essentially I want to see what the underlying compilation line
>>>>> is.
>>>>>
>>>>> Can you confirm what you had $DO_PARALLEL set to when you ran make
>>>>> test for the parallel GPU build. Also can you confirm if the regular (CPU)
>>>>> parallel build passed the tests please?
>>>>>
>>>>> Also did you run 'make clean' before each build step? E.g.
>>>>>
>>>>> ./configure -cuda gnu
>>>>> make -j8 install
>>>>> make test
>>>>> *make clean*
>>>>>
>>>>> ./configure -cuda -mpi gnu
>>>>> make -j8 install
>>>>> make test
>>>>>
>>>>> Have you tried any other MPI installations? - E.g. MPICH?
>>>>>
>>>>> And finally can you please confirm which version of Amber (and
>>>>> AmberTools) this is and which patches have been applied?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu> wrote:
>>>>>
>>>>> Ross,
>>>>>
>>>>> Attached are the log and diff files. Thank you for taking a look.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Steve
>>>>>
>>>>> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi Steve,
>>>>>>
>>>>>> Indeed that is too big a difference to just be rounding error -
>>>>>> although if those tests are using Langevin or Anderson for the thermostat
>>>>>> that would explain it (different random number streams) - although those
>>>>>> tests are supposed to be skipped in parallel.
>>>>>>
>>>>>> Can you send me a copy directly of your .log and .dif files for the 2
>>>>>> GPU run and I'll take a closer look at it.
>>>>>>
>>>>>> All the best
>>>>>> Ross
>>>>>>
>>>>>> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hello All,
>>>>>> >
>>>>>> > I currently trying to get Amber16 installed and running on our
>>>>>> computing
>>>>>> > cluster. Our researchers are primarily interested in running the GPU
>>>>>> > accelerated programs. For GPU computing jobs, we have one CentOS
>>>>>> 6.7 node
>>>>>> > with a Tesla K80.
>>>>>> >
>>>>>> > I was able to build Amber16 and run the Serial/Parallel CPU plus
>>>>>> the Serial
>>>>>> > GPU tests with all file comparisons passing. However, only 5
>>>>>> parallel GPU
>>>>>> > tests succeeded, while the other 100 comparisons failed.
>>>>>> >
>>>>>> > Examining the diff file shows that some of the numbers are not off
>>>>>> by much
>>>>>> > like the documentation said could happen. For example:
>>>>>> >
>>>>>> > 66c66
>>>>>> > < NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 351.27
>>>>>> PRESS =
>>>>>> > 0.
>>>>>> >> NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 353.29
>>>>>> PRESS =
>>>>>> > 0.
>>>>>> >
>>>>>> > This may also be too large to attribute to a rounding error, but it
>>>>>> is a
>>>>>> > small difference compared to others:
>>>>>> >
>>>>>> > 85c85
>>>>>> > < Etot = -217.1552 EKtot = 238.6655 EPtot =
>>>>>> > -455.8207
>>>>>> >> Etot = -1014.2562 EKtot = 244.6242 EPtot =
>>>>>> > -1258.8804
>>>>>> >
>>>>>> > This was build with CUDA 7.5, OpenMPI 1.8, and run with
>>>>>> DO_PARALLEL="mpirun
>>>>>> > -np 2"
>>>>>> >
>>>>>> > Any idea what else could be affecting the output?
>>>>>> >
>>>>>> > Thanks,
>>>>>> >
>>>>>> > Steve
>>>>>> >
>>>>>> > --
>>>>>> > Steven Ford
>>>>>> > IT Infrastructure Specialist
>>>>>> > Institute for Bioscience and Biotechnology Research
>>>>>> > University of Maryland
>>>>>> > (240)314-6405
>>>>>> > _______________________________________________
>>>>>> > AMBER mailing list
>>>>>> > AMBER.ambermd.org
>>>>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steven Ford
>>>>> IT Infrastructure Specialist
>>>>> Institute for Bioscience and Biotechnology Research
>>>>> University of Maryland
>>>>> (240)314-6405
>>>>> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Steven Ford
>>>> IT Infrastructure Specialist
>>>> Institute for Bioscience and Biotechnology Research
>>>> University of Maryland
>>>> (240)314-6405
>>>>
>>>
>>>
>>>
>>> --
>>> Steven Ford
>>> IT Infrastructure Specialist
>>> Institute for Bioscience and Biotechnology Research
>>> University of Maryland
>>> (240)314-6405
>>>
>>>
>>>
>>
>>
>> --
>> Steven Ford
>> IT Infrastructure Specialist
>> Institute for Bioscience and Biotechnology Research
>> University of Maryland
>> (240)314-6405
>>
>>
>>

-- 
Steven Ford
IT Infrastructure Specialist
Institute for Bioscience and Biotechnology Research
University of Maryland
(240)314-6405
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Aug 11 2016 - 22:00:03 PDT