Re: [AMBER] gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered

From: Ryan Novosielski <novosirj.ca.rutgers.edu>
Date: Tue, 16 Feb 2016 09:20:49 -0500

I would be interested to know what the setting was, and I imagine others might as well.

Sent from my iPhone

> On Feb 15, 2016, at 12:04, Sarah Anderson <saraha.cray.com> wrote:
>
> Hi Ross,
>
> Thanks for the pointer to simpleP2P. That illustrated the problem well enough to identify and fix a pci-bridge setting that somehow
> became unset. We have also updated to the current Nvidia driver version 352.79
>
> All is well now, both with simpleP2P and amber.
>
> Sarah
>
>
>> On 2/12/16 7:03 PM, Ross Walker wrote:
>> Hi Sarah,
>>
>> This is most likely a hardware configuration issue. There are a lot of vendors out there shipping misconfigured bios's with Haswell systems right now. I am guessing these are Haswell (v3) CPUs yes?
>>
>> As well as the output of the lspci command can you try building simpleP2P from the NVIDIA CUDA Samples and run it with the various combinations of GPUs.
>>
>> E.g.
>> unset CUDA_VISIBLE_DEVICES
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=0,1
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=0,2
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=0,3
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=1,2
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=1,3
>> ./simpleP2P
>>
>> export CUDA_VISIBLE_DEVICES=2,3
>> ./simpleP2P
>>
>> I suspect various combinations here will report that they can do P2P communication but will fail the test.
>>
>> All the best
>> Ross
>>
>>> On Feb 12, 2016, at 16:58, Sarah Anderson <saraha.cray.com> wrote:
>>>
>>> Hi Ross,
>>>> Checking for updates...
>>>> Checking for available patches online. This may take a few seconds...
>>>>
>>>> Available AmberTools 15 patches:
>>>>
>>>> No patches available
>>>>
>>>> Available Amber 14 patches:
>>>>
>>>> No patches available
>>> Tried
>>> openmpi/1.8.4_gcc 2) cudatoolkit/7.5.18 3) gcc/4.9.1
>>>
>>> The single-GPU configure -mpi -cuda -noX11 gnu
>>> worked fine, but two-GPUs produced the same error.
>>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered
>>>> srun: error: prd2-0171: task 0: Exited with exit code 255
>>> I tried a MPI only gnu build, which worked fine with 16 or 32 ranks.
>>>
>>> I dropped back to cudatoolkit/7.0.28
>>> Same message. I'm out of ideas for now. When I have the output of
>>>> lspci -d "10b5:*" -vvv | grep ACSCtl
>>> I'll send it to you.
>>>
>>> Sarah
>>>
>>>
>>>> On 2/12/16 4:36 PM, Ross Walker wrote:
>>>> Hi Sarah,
>>>>
>>>> I suspect this is indeed an MPI issue - likely related to it screwing up how P2P communication works - but it could also be a bios issue or a compiler issue - again related to peer to peer communication. I've seen race conditions in a few of the Intel compiler versions but never been able to track it down and concluded it was a compiler bug. Can you check a couple of things please:
>>>>
>>>> 1) Make sure all the latest updates are applied.
>>>>
>>>> 2) Build with the GNU compilers and see if the problem is still there.
>>>>
>>>> Also the output from the following command would be useful in determining if it is a bios bug on the hardware or not:
>>>>
>>>> lspci -d "10b5:*" -vvv | grep ACSCtl
>>>>
>>>> (But it unfortunately needs to be run as root).
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>> On Feb 12, 2016, at 14:24, Sarah Anderson<saraha.cray.com> wrote:
>>>>>
>>>>> Hi Ross,
>>>>>
>>>>> I was running this test:
>>>>>
>>>>> Amber14_Benchmark_Suite/PME/JAC_production_NVE
>>>>>
>>>>> srun -n 2 --mpi=pmi2 --cpu_bind=socket $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.GPU -o mdout.2GPU -p prmtop -c inpcrd
>>>>>
>>>>> Yes, a single GPU run completes normally.
>>>>>
>>>>> I tried another test in that suite, and though the 2GPU test completed, it came up with different (nonsense) answers.
>>>>> Maybe there's something wrong with the MPI/build. It's just Intel impi & "configure -cuda -mpi -noX11 intel"
>>>>>
>>>>> Thanks,
>>>>> Sarah
>>>>>
>>>>> Two GPUs
>>>>>
>>>>> R M S F L U C T U A T I O N S
>>>>>
>>>>>
>>>>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 6.13 PRESS = 0.0
>>>>> Etot = 10445.2658 EKtot = 388.9169 EPtot = 10213.0555
>>>>> BOND = 138.3458 ANGLE = 184.8487 DIHED = 119.1381
>>>>> 1-4 NB = 35.9507 1-4 EEL = 842.2853 VDWAALS = 9138.2936
>>>>> EELEC = 2970.4534 EGB = 3580.7687 RESTRAINT = 0.0000
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> Single GPU:
>>>>>
>>>>> R M S F L U C T U A T I O N S
>>>>>
>>>>>
>>>>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 0.91 PRESS = 0.0
>>>>> Etot = 22.9288 EKtot = 57.4768 EPtot = 49.0089
>>>>> BOND = 32.3643 ANGLE = 82.1033 DIHED = 80.2520
>>>>> 1-4 NB = 14.6668 1-4 EEL = 107.9454 VDWAALS = 148.9948
>>>>> EELEC = 280.7436 EGB = 212.3126 RESTRAINT = 0.0000
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 2/12/16 1:16 PM, Ross Walker wrote:
>>>>>> Hi Sarah,
>>>>>>
>>>>>> The error message doesn't tell a lot here unfortunately. The issue with GPU error messages is that if you have an array in memory that contains a NAN - say the force array got an infinite force - then things will be fine 'until' the code tries to upload or download from the GPU (or do a similar copy operation) - NAN's are not supported in these operations and thus you get the error that you see. The real error - e.g. 2 atoms sitting on top of each other occurred somewhere else entirely within the code. The net result is that just because the error is in the gpu_allreduce does not mean it is related to something wrong with the GPUs, or a driver issue or even a multi-GPU issue.
>>>>>>
>>>>>> What it likely means there is something wrong with the simulation itself that you are running. Have you tried running on just 1 GPU and see if that crashes as well (and also CPU?). Can you provide some more details about what you are actually simulating.
>>>>>>
>>>>>> All the best
>>>>>> Ross
>>>>>>
>>>>>>> On Feb 12, 2016, at 10:42, Sarah Anderson<saraha.cray.com> wrote:
>>>>>>>
>>>>>>> Has anyone seen this message lately? I saw some notes about in mid 2015 but no particular fix suggested.
>>>>>>>
>>>>>>> This is with cuda 7.5 using a pair of K80 GPUS in peer-to-peer mode.
>>>>>>>
>>>>>>> It fails with all combinations of CUDA_VISIBLE_DEVICES 0,1 2,3
>>>>>>>
>>>>>>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered
>>>>>>> |------------------- GPU DEVICE INFO --------------------
>>>>>>> |
>>>>>>> | Task ID: 0
>>>>>>> | CUDA_VISIBLE_DEVICES: 0,1
>>>>>>> | CUDA Capable Devices Detected: 2
>>>>>>> | CUDA Device ID in use: 0
>>>>>>> | CUDA Device Name: Tesla K80
>>>>>>> | CUDA Device Global Mem Size: 11519 MB
>>>>>>> | CUDA Device Num Multiprocessors: 13
>>>>>>> | CUDA Device Core Freq: 0.82 GHz
>>>>>>> |
>>>>>>> |
>>>>>>> | Task ID: 1
>>>>>>> | CUDA_VISIBLE_DEVICES: 0,1
>>>>>>> | CUDA Capable Devices Detected: 2
>>>>>>> | CUDA Device ID in use: 1
>>>>>>> | CUDA Device Name: Tesla K80
>>>>>>> | CUDA Device Global Mem Size: 11519 MB
>>>>>>> | CUDA Device Num Multiprocessors: 13
>>>>>>> | CUDA Device Core Freq: 0.82 GHz
>>>>>>> |
>>>>>>> |--------------------------------------------------------
>>>>>>>
>>>>>>> |---------------- GPU PEER TO PEER INFO -----------------
>>>>>>> |
>>>>>>> | Peer to Peer support: ENABLED
>>>>>>> |
>>>>>>> |--------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> Here is deviceQuery
>>>>>>>
>>>>>>> CUDA Device Query (Runtime API) version (CUDART static linking)
>>>>>>>
>>>>>>> Detected 2 CUDA Capable device(s)
>>>>>>>
>>>>>>> Device 0: "Tesla K80"
>>>>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
>>>>>>> CUDA Capability Major/Minor version number: 3.7
>>>>>>> Total amount of global memory: 11520 MBytes (12079136768 bytes)
>>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
>>>>>>> GPU Clock rate: 824 MHz (0.82 GHz)
>>>>>>> Memory Clock rate: 2505 Mhz
>>>>>>> Memory Bus Width: 384-bit
>>>>>>> L2 Cache Size: 1572864 bytes
>>>>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>>>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
>>>>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
>>>>>>> Total amount of constant memory: 65536 bytes
>>>>>>> Total amount of shared memory per block: 49152 bytes
>>>>>>> Total number of registers available per block: 65536
>>>>>>> Warp size: 32
>>>>>>> Maximum number of threads per multiprocessor: 2048
>>>>>>> Maximum number of threads per block: 1024
>>>>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>>>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
>>>>>>> Maximum memory pitch: 2147483647 bytes
>>>>>>> Texture alignment: 512 bytes
>>>>>>> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
>>>>>>> Run time limit on kernels: No
>>>>>>> Integrated GPU sharing Host Memory: No
>>>>>>> Support host page-locked memory mapping: Yes
>>>>>>> Alignment requirement for Surfaces: Yes
>>>>>>> Device has ECC support: Enabled
>>>>>>> Device supports Unified Addressing (UVA): Yes
>>>>>>> Device PCI Bus ID / PCI location ID: 5 / 0
>>>>>>> Compute Mode:
>>>>>>> < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>>>>>>
>>>>>>> Device 1: "Tesla K80"
>>>>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
>>>>>>> CUDA Capability Major/Minor version number: 3.7
>>>>>>> Total amount of global memory: 11520 MBytes (12079136768 bytes)
>>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
>>>>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
>>>>>>> GPU Clock rate: 824 MHz (0.82 GHz)
>>>>>>> Memory Clock rate: 2505 Mhz
>>>>>>> Memory Bus Width: 384-bit
>>>>>>> L2 Cache Size: 1572864 bytes
>>>>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>>>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
>>>>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
>>>>>>> Total amount of constant memory: 65536 bytes
>>>>>>> Total amount of shared memory per block: 49152 bytes
>>>>>>> Total number of registers available per block: 65536
>>>>>>> Warp size: 32
>>>>>>> Maximum number of threads per multiprocessor: 2048
>>>>>>> Maximum number of threads per block: 1024
>>>>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>>>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
>>>>>>> Maximum memory pitch: 2147483647 bytes
>>>>>>> Texture alignment: 512 bytes
>>>>>>> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
>>>>>>> Run time limit on kernels: No
>>>>>>> Integrated GPU sharing Host Memory: No
>>>>>>> Support host page-locked memory mapping: Yes
>>>>>>> Alignment requirement for Surfaces: Yes
>>>>>>> Device has ECC support: Enabled
>>>>>>> Device supports Unified Addressing (UVA): Yes
>>>>>>> Device PCI Bus ID / PCI location ID: 6 / 0
>>>>>>> Compute Mode:
>>>>>>> < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>>>>>>> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>>>>>>>> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K80, Device1
>>>>>>> = Tesla K80
>>>>>>> Result = PASS
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> AMBER mailing list
>>>>>>> AMBER.ambermd.org
>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 16 2016 - 06:30:03 PST
Custom Search