Re: [AMBER] gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered from Ross Walker on 2016-02-13 (Amber Archive Feb 2016)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sat, 13 Feb 2016 10:13:19 -0800

Hi Dan and Sarah,

MPICH would be the simple choice - since the GPU code doesn't need any fancy MPI technology so the simpler the better. But that said given it is so simple I doubt this is the problem. Thinking this through some more here it is very unlikely to be the MPI. My suspicion is misconfigured hardware that means the PCI-E bus cannot properly communicate with peer to peer communication. It reports it is doing it but is likely just communicating garbage. The NVIDIA simpleP2P test should be the definitive answer here.

Out of interest what is the hardware - as in who makes the nodes / motherboards etc?

Also do you know what the OS is here - and can you have someone confirm they aren't running some kind of virtual machine or other software here that abstracts things from the bare metal hardware.

All the best
Ross

> On Feb 13, 2016, at 08:18, Daniel Roe <daniel.r.roe.gmail.com> wrote:
>
> On Fri, Feb 12, 2016 at 5:58 PM, Sarah Anderson <saraha.cray.com> wrote:
>> Tried
>> openmpi/1.8.4_gcc 2) cudatoolkit/7.5.18 3) gcc/4.9.1
>
> Another thing you may want to try (if you haven't already) is MPICH or
> MVAPICH. I've haven't had a lot of success with OpenMPI and Amber over
> the years, but MPICH/MVAPICH usually works great for me.
>
> -Dan
>
>>
>> The single-GPU configure -mpi -cuda -noX11 gnu
>> worked fine, but two-GPUs produced the same error.
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered
>>> srun: error: prd2-0171: task 0: Exited with exit code 255
>> I tried a MPI only gnu build, which worked fine with 16 or 32 ranks.
>>
>> I dropped back to cudatoolkit/7.0.28
>> Same message. I'm out of ideas for now. When I have the output of
>>> lspci -d "10b5:*" -vvv | grep ACSCtl
>> I'll send it to you.
>>
>> Sarah
>>
>>
>> On 2/12/16 4:36 PM, Ross Walker wrote:
>>> Hi Sarah,
>>>
>>> I suspect this is indeed an MPI issue - likely related to it screwing up how P2P communication works - but it could also be a bios issue or a compiler issue - again related to peer to peer communication. I've seen race conditions in a few of the Intel compiler versions but never been able to track it down and concluded it was a compiler bug. Can you check a couple of things please:
>>>
>>> 1) Make sure all the latest updates are applied.
>>>
>>> 2) Build with the GNU compilers and see if the problem is still there.
>>>
>>> Also the output from the following command would be useful in determining if it is a bios bug on the hardware or not:
>>>
>>> lspci -d "10b5:*" -vvv | grep ACSCtl
>>>
>>> (But it unfortunately needs to be run as root).
>>>
>>> All the best
>>> Ross
>>>
>>>> On Feb 12, 2016, at 14:24, Sarah Anderson<saraha.cray.com> wrote:
>>>>
>>>> Hi Ross,
>>>>
>>>> I was running this test:
>>>>
>>>> Amber14_Benchmark_Suite/PME/JAC_production_NVE
>>>>
>>>> srun -n 2 --mpi=pmi2 --cpu_bind=socket $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.GPU -o mdout.2GPU -p prmtop -c inpcrd
>>>>
>>>> Yes, a single GPU run completes normally.
>>>>
>>>> I tried another test in that suite, and though the 2GPU test completed, it came up with different (nonsense) answers.
>>>> Maybe there's something wrong with the MPI/build. It's just Intel impi & "configure -cuda -mpi -noX11 intel"
>>>>
>>>> Thanks,
>>>> Sarah
>>>>
>>>> Two GPUs
>>>>
>>>> R M S F L U C T U A T I O N S
>>>>
>>>>
>>>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 6.13 PRESS = 0.0
>>>> Etot = 10445.2658 EKtot = 388.9169 EPtot = 10213.0555
>>>> BOND = 138.3458 ANGLE = 184.8487 DIHED = 119.1381
>>>> 1-4 NB = 35.9507 1-4 EEL = 842.2853 VDWAALS = 9138.2936
>>>> EELEC = 2970.4534 EGB = 3580.7687 RESTRAINT = 0.0000
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> Single GPU:
>>>>
>>>> R M S F L U C T U A T I O N S
>>>>
>>>>
>>>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 0.91 PRESS = 0.0
>>>> Etot = 22.9288 EKtot = 57.4768 EPtot = 49.0089
>>>> BOND = 32.3643 ANGLE = 82.1033 DIHED = 80.2520
>>>> 1-4 NB = 14.6668 1-4 EEL = 107.9454 VDWAALS = 148.9948
>>>> EELEC = 280.7436 EGB = 212.3126 RESTRAINT = 0.0000
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>> On 2/12/16 1:16 PM, Ross Walker wrote:
>>>>> Hi Sarah,
>>>>>
>>>>> The error message doesn't tell a lot here unfortunately. The issue with GPU error messages is that if you have an array in memory that contains a NAN - say the force array got an infinite force - then things will be fine 'until' the code tries to upload or download from the GPU (or do a similar copy operation) - NAN's are not supported in these operations and thus you get the error that you see. The real error - e.g. 2 atoms sitting on top of each other occurred somewhere else entirely within the code. The net result is that just because the error is in the gpu_allreduce does not mean it is related to something wrong with the GPUs, or a driver issue or even a multi-GPU issue.
>>>>>
>>>>> What it likely means there is something wrong with the simulation itself that you are running. Have you tried running on just 1 GPU and see if that crashes as well (and also CPU?). Can you provide some more details about what you are actually simulating.
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>>> On Feb 12, 2016, at 10:42, Sarah Anderson<saraha.cray.com> wrote:
>>>>>>
>>>>>> Has anyone seen this message lately? I saw some notes about in mid 2015 but no particular fix suggested.
>>>>>>
>>>>>> This is with cuda 7.5 using a pair of K80 GPUS in peer-to-peer mode.
>>>>>>
>>>>>> It fails with all combinations of CUDA_VISIBLE_DEVICES 0,1 2,3
>>>>>>
>>>>>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered
>>>>>> |------------------- GPU DEVICE INFO --------------------
>>>>>> |
>>>>>> | Task ID: 0
>>>>>> | CUDA_VISIBLE_DEVICES: 0,1
>>>>>> | CUDA Capable Devices Detected: 2
>>>>>> | CUDA Device ID in use: 0
>>>>>> | CUDA Device Name: Tesla K80
>>>>>> | CUDA Device Global Mem Size: 11519 MB
>>>>>> | CUDA Device Num Multiprocessors: 13
>>>>>> | CUDA Device Core Freq: 0.82 GHz
>>>>>> |
>>>>>> |
>>>>>> | Task ID: 1
>>>>>> | CUDA_VISIBLE_DEVICES: 0,1
>>>>>> | CUDA Capable Devices Detected: 2
>>>>>> | CUDA Device ID in use: 1
>>>>>> | CUDA Device Name: Tesla K80
>>>>>> | CUDA Device Global Mem Size: 11519 MB
>>>>>> | CUDA Device Num Multiprocessors: 13
>>>>>> | CUDA Device Core Freq: 0.82 GHz
>>>>>> |
>>>>>> |--------------------------------------------------------
>>>>>>
>>>>>> |---------------- GPU PEER TO PEER INFO -----------------
>>>>>> |
>>>>>> | Peer to Peer support: ENABLED
>>>>>> |
>>>>>> |--------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> Here is deviceQuery
>>>>>>
>>>>>> CUDA Device Query (Runtime API) version (CUDART static linking)
>>>>>>
>>>>>> Detected 2 CUDA Capable device(s)
>>>>>>
>>>>>> Device 0: "Tesla K80"
>>>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
>>>>>> CUDA Capability Major/Minor version number: 3.7
>>>>>> Total amount of global memory: 11520 MBytes (12079136768 bytes)
>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
>>>>>> GPU Clock rate: 824 MHz (0.82 GHz)
>>>>>> Memory Clock rate: 2505 Mhz
>>>>>> Memory Bus Width: 384-bit
>>>>>> L2 Cache Size: 1572864 bytes
>>>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
>>>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
>>>>>> Total amount of constant memory: 65536 bytes
>>>>>> Total amount of shared memory per block: 49152 bytes
>>>>>> Total number of registers available per block: 65536
>>>>>> Warp size: 32
>>>>>> Maximum number of threads per multiprocessor: 2048
>>>>>> Maximum number of threads per block: 1024
>>>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
>>>>>> Maximum memory pitch: 2147483647 bytes
>>>>>> Texture alignment: 512 bytes
>>>>>> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
>>>>>> Run time limit on kernels: No
>>>>>> Integrated GPU sharing Host Memory: No
>>>>>> Support host page-locked memory mapping: Yes
>>>>>> Alignment requirement for Surfaces: Yes
>>>>>> Device has ECC support: Enabled
>>>>>> Device supports Unified Addressing (UVA): Yes
>>>>>> Device PCI Bus ID / PCI location ID: 5 / 0
>>>>>> Compute Mode:
>>>>>> < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>>>>>
>>>>>> Device 1: "Tesla K80"
>>>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
>>>>>> CUDA Capability Major/Minor version number: 3.7
>>>>>> Total amount of global memory: 11520 MBytes (12079136768 bytes)
>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
>>>>>> GPU Clock rate: 824 MHz (0.82 GHz)
>>>>>> Memory Clock rate: 2505 Mhz
>>>>>> Memory Bus Width: 384-bit
>>>>>> L2 Cache Size: 1572864 bytes
>>>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
>>>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
>>>>>> Total amount of constant memory: 65536 bytes
>>>>>> Total amount of shared memory per block: 49152 bytes
>>>>>> Total number of registers available per block: 65536
>>>>>> Warp size: 32
>>>>>> Maximum number of threads per multiprocessor: 2048
>>>>>> Maximum number of threads per block: 1024
>>>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
>>>>>> Maximum memory pitch: 2147483647 bytes
>>>>>> Texture alignment: 512 bytes
>>>>>> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
>>>>>> Run time limit on kernels: No
>>>>>> Integrated GPU sharing Host Memory: No
>>>>>> Support host page-locked memory mapping: Yes
>>>>>> Alignment requirement for Surfaces: Yes
>>>>>> Device has ECC support: Enabled
>>>>>> Device supports Unified Addressing (UVA): Yes
>>>>>> Device PCI Bus ID / PCI location ID: 6 / 0
>>>>>> Compute Mode:
>>>>>> < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>>>>>> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>>>>>>> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K80, Device1
>>>>>> = Tesla K80
>>>>>> Result = PASS
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> --
> -------------------------
> Daniel R. Roe, PhD
> Department of Medicinal Chemistry
> University of Utah
> 30 South 2000 East, Room 307
> Salt Lake City, UT 84112-5820
> http://home.chpc.utah.edu/~cheatham/
> (801) 587-9652
> (801) 585-6208 (Fax)
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Feb 13 2016 - 10:30:03 PST