Re: [AMBER] gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sat, 13 Feb 2016 08:41:04 -0800

The MPI API of choice tends to rotate over the years. Sigh...

That said, once I eliminated the MPI 2.0 calls from AMBER, they all got a
lot more stable.

On Sat, Feb 13, 2016 at 8:18 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> On Fri, Feb 12, 2016 at 5:58 PM, Sarah Anderson <saraha.cray.com> wrote:
> > Tried
> > openmpi/1.8.4_gcc 2) cudatoolkit/7.5.18 3) gcc/4.9.1
>
> Another thing you may want to try (if you haven't already) is MPICH or
> MVAPICH. I've haven't had a lot of success with OpenMPI and Amber over
> the years, but MPICH/MVAPICH usually works great for me.
>
> -Dan
>
> >
> > The single-GPU configure -mpi -cuda -noX11 gnu
> > worked fine, but two-GPUs produced the same error.
> >> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> encountered
> >> srun: error: prd2-0171: task 0: Exited with exit code 255
> > I tried a MPI only gnu build, which worked fine with 16 or 32 ranks.
> >
> > I dropped back to cudatoolkit/7.0.28
> > Same message. I'm out of ideas for now. When I have the output of
> >> lspci -d "10b5:*" -vvv | grep ACSCtl
> > I'll send it to you.
> >
> > Sarah
> >
> >
> > On 2/12/16 4:36 PM, Ross Walker wrote:
> >> Hi Sarah,
> >>
> >> I suspect this is indeed an MPI issue - likely related to it screwing
> up how P2P communication works - but it could also be a bios issue or a
> compiler issue - again related to peer to peer communication. I've seen
> race conditions in a few of the Intel compiler versions but never been able
> to track it down and concluded it was a compiler bug. Can you check a
> couple of things please:
> >>
> >> 1) Make sure all the latest updates are applied.
> >>
> >> 2) Build with the GNU compilers and see if the problem is still there.
> >>
> >> Also the output from the following command would be useful in
> determining if it is a bios bug on the hardware or not:
> >>
> >> lspci -d "10b5:*" -vvv | grep ACSCtl
> >>
> >> (But it unfortunately needs to be run as root).
> >>
> >> All the best
> >> Ross
> >>
> >>> On Feb 12, 2016, at 14:24, Sarah Anderson<saraha.cray.com> wrote:
> >>>
> >>> Hi Ross,
> >>>
> >>> I was running this test:
> >>>
> >>> Amber14_Benchmark_Suite/PME/JAC_production_NVE
> >>>
> >>> srun -n 2 --mpi=pmi2 --cpu_bind=socket $AMBERHOME/bin/pmemd.cuda.MPI
> -O -i mdin.GPU -o mdout.2GPU -p prmtop -c inpcrd
> >>>
> >>> Yes, a single GPU run completes normally.
> >>>
> >>> I tried another test in that suite, and though the 2GPU test
> completed, it came up with different (nonsense) answers.
> >>> Maybe there's something wrong with the MPI/build. It's just Intel impi
> & "configure -cuda -mpi -noX11 intel"
> >>>
> >>> Thanks,
> >>> Sarah
> >>>
> >>> Two GPUs
> >>>
> >>> R M S F L U C T U A T I O N S
> >>>
> >>>
> >>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 6.13 PRESS =
> 0.0
> >>> Etot = 10445.2658 EKtot = 388.9169 EPtot =
> 10213.0555
> >>> BOND = 138.3458 ANGLE = 184.8487 DIHED =
> 119.1381
> >>> 1-4 NB = 35.9507 1-4 EEL = 842.2853 VDWAALS =
> 9138.2936
> >>> EELEC = 2970.4534 EGB = 3580.7687 RESTRAINT =
> 0.0000
> >>>
> ------------------------------------------------------------------------------
> >>>
> >>>
> >>>
> >>> Single GPU:
> >>>
> >>> R M S F L U C T U A T I O N S
> >>>
> >>>
> >>> NSTEP = 1000 TIME(PS) = 102.000 TEMP(K) = 0.91 PRESS =
> 0.0
> >>> Etot = 22.9288 EKtot = 57.4768 EPtot =
> 49.0089
> >>> BOND = 32.3643 ANGLE = 82.1033 DIHED =
> 80.2520
> >>> 1-4 NB = 14.6668 1-4 EEL = 107.9454 VDWAALS =
> 148.9948
> >>> EELEC = 280.7436 EGB = 212.3126 RESTRAINT =
> 0.0000
> >>>
> ------------------------------------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> On 2/12/16 1:16 PM, Ross Walker wrote:
> >>>> Hi Sarah,
> >>>>
> >>>> The error message doesn't tell a lot here unfortunately. The issue
> with GPU error messages is that if you have an array in memory that
> contains a NAN - say the force array got an infinite force - then things
> will be fine 'until' the code tries to upload or download from the GPU (or
> do a similar copy operation) - NAN's are not supported in these operations
> and thus you get the error that you see. The real error - e.g. 2 atoms
> sitting on top of each other occurred somewhere else entirely within the
> code. The net result is that just because the error is in the gpu_allreduce
> does not mean it is related to something wrong with the GPUs, or a driver
> issue or even a multi-GPU issue.
> >>>>
> >>>> What it likely means there is something wrong with the simulation
> itself that you are running. Have you tried running on just 1 GPU and see
> if that crashes as well (and also CPU?). Can you provide some more details
> about what you are actually simulating.
> >>>>
> >>>> All the best
> >>>> Ross
> >>>>
> >>>>> On Feb 12, 2016, at 10:42, Sarah Anderson<saraha.cray.com> wrote:
> >>>>>
> >>>>> Has anyone seen this message lately? I saw some notes about in mid
> 2015 but no particular fix suggested.
> >>>>>
> >>>>> This is with cuda 7.5 using a pair of K80 GPUS in peer-to-peer mode.
> >>>>>
> >>>>> It fails with all combinations of CUDA_VISIBLE_DEVICES 0,1 2,3
> >>>>>
> >>>>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
> was encountered
> >>>>> |------------------- GPU DEVICE INFO --------------------
> >>>>> |
> >>>>> | Task ID: 0
> >>>>> | CUDA_VISIBLE_DEVICES: 0,1
> >>>>> | CUDA Capable Devices Detected: 2
> >>>>> | CUDA Device ID in use: 0
> >>>>> | CUDA Device Name: Tesla K80
> >>>>> | CUDA Device Global Mem Size: 11519 MB
> >>>>> | CUDA Device Num Multiprocessors: 13
> >>>>> | CUDA Device Core Freq: 0.82 GHz
> >>>>> |
> >>>>> |
> >>>>> | Task ID: 1
> >>>>> | CUDA_VISIBLE_DEVICES: 0,1
> >>>>> | CUDA Capable Devices Detected: 2
> >>>>> | CUDA Device ID in use: 1
> >>>>> | CUDA Device Name: Tesla K80
> >>>>> | CUDA Device Global Mem Size: 11519 MB
> >>>>> | CUDA Device Num Multiprocessors: 13
> >>>>> | CUDA Device Core Freq: 0.82 GHz
> >>>>> |
> >>>>> |--------------------------------------------------------
> >>>>>
> >>>>> |---------------- GPU PEER TO PEER INFO -----------------
> >>>>> |
> >>>>> | Peer to Peer support: ENABLED
> >>>>> |
> >>>>> |--------------------------------------------------------
> >>>>>
> >>>>>
> >>>>> Here is deviceQuery
> >>>>>
> >>>>> CUDA Device Query (Runtime API) version (CUDART static linking)
> >>>>>
> >>>>> Detected 2 CUDA Capable device(s)
> >>>>>
> >>>>> Device 0: "Tesla K80"
> >>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
> >>>>> CUDA Capability Major/Minor version number: 3.7
> >>>>> Total amount of global memory: 11520 MBytes
> (12079136768 bytes)
> >>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
> >>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
> >>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
> >>>>> GPU Clock rate: 824 MHz (0.82 GHz)
> >>>>> Memory Clock rate: 2505 Mhz
> >>>>> Memory Bus Width: 384-bit
> >>>>> L2 Cache Size: 1572864 bytes
> >>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536),
> 2D=(65536, 65536), 3D=(4096, 4096, 4096)
> >>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048
> layers
> >>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384),
> 2048 layers
> >>>>> Total amount of constant memory: 65536 bytes
> >>>>> Total amount of shared memory per block: 49152 bytes
> >>>>> Total number of registers available per block: 65536
> >>>>> Warp size: 32
> >>>>> Maximum number of threads per multiprocessor: 2048
> >>>>> Maximum number of threads per block: 1024
> >>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
> >>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535,
> 65535)
> >>>>> Maximum memory pitch: 2147483647 bytes
> >>>>> Texture alignment: 512 bytes
> >>>>> Concurrent copy and kernel execution: Yes with 2 copy
> engine(s)
> >>>>> Run time limit on kernels: No
> >>>>> Integrated GPU sharing Host Memory: No
> >>>>> Support host page-locked memory mapping: Yes
> >>>>> Alignment requirement for Surfaces: Yes
> >>>>> Device has ECC support: Enabled
> >>>>> Device supports Unified Addressing (UVA): Yes
> >>>>> Device PCI Bus ID / PCI location ID: 5 / 0
> >>>>> Compute Mode:
> >>>>> < Default (multiple host threads can use ::cudaSetDevice()
> with device simultaneously) >
> >>>>>
> >>>>> Device 1: "Tesla K80"
> >>>>> CUDA Driver Version / Runtime Version 7.5 / 7.5
> >>>>> CUDA Capability Major/Minor version number: 3.7
> >>>>> Total amount of global memory: 11520 MBytes
> (12079136768 bytes)
> >>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
> >>>>> MapSMtoCores for SM 3.7 is undefined. Default to use 192 Cores/SM
> >>>>> (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
> >>>>> GPU Clock rate: 824 MHz (0.82 GHz)
> >>>>> Memory Clock rate: 2505 Mhz
> >>>>> Memory Bus Width: 384-bit
> >>>>> L2 Cache Size: 1572864 bytes
> >>>>> Maximum Texture Dimension Size (x,y,z) 1D=(65536),
> 2D=(65536, 65536), 3D=(4096, 4096, 4096)
> >>>>> Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048
> layers
> >>>>> Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384),
> 2048 layers
> >>>>> Total amount of constant memory: 65536 bytes
> >>>>> Total amount of shared memory per block: 49152 bytes
> >>>>> Total number of registers available per block: 65536
> >>>>> Warp size: 32
> >>>>> Maximum number of threads per multiprocessor: 2048
> >>>>> Maximum number of threads per block: 1024
> >>>>> Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
> >>>>> Max dimension size of a grid size (x,y,z): (2147483647, 65535,
> 65535)
> >>>>> Maximum memory pitch: 2147483647 bytes
> >>>>> Texture alignment: 512 bytes
> >>>>> Concurrent copy and kernel execution: Yes with 2 copy
> engine(s)
> >>>>> Run time limit on kernels: No
> >>>>> Integrated GPU sharing Host Memory: No
> >>>>> Support host page-locked memory mapping: Yes
> >>>>> Alignment requirement for Surfaces: Yes
> >>>>> Device has ECC support: Enabled
> >>>>> Device supports Unified Addressing (UVA): Yes
> >>>>> Device PCI Bus ID / PCI location ID: 6 / 0
> >>>>> Compute Mode:
> >>>>> < Default (multiple host threads can use ::cudaSetDevice()
> with device simultaneously) >
> >>>>>> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> >>>>>> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
> >>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA
> Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K80, Device1
> >>>>> = Tesla K80
> >>>>> Result = PASS
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> --
> -------------------------
> Daniel R. Roe, PhD
> Department of Medicinal Chemistry
> University of Utah
> 30 South 2000 East, Room 307
> Salt Lake City, UT 84112-5820
> http://home.chpc.utah.edu/~cheatham/
> (801) 587-9652
> (801) 585-6208 (Fax)
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Feb 13 2016 - 09:00:03 PST
Custom Search