Hi Ross,
I was running this test:
Amber14_Benchmark_Suite/PME/JAC_production_NVE
srun  -n 2 --mpi=pmi2  --cpu_bind=socket $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin.GPU  -o mdout.2GPU -p prmtop -c inpcrd
Yes, a single GPU run completes normally.
I tried another test in that suite, and though the 2GPU test completed, it came up with different (nonsense) answers.
Maybe there's something wrong with the MPI/build. It's just Intel impi & "configure -cuda -mpi -noX11 intel"
Thanks,
Sarah
Two GPUs
       R M S  F L U C T U A T I O N S
  NSTEP =     1000   TIME(PS) =     102.000  TEMP(K) = 6.13  PRESS =     0.0
  Etot   =     10445.2658  EKtot   =       388.9169 EPtot      =     10213.0555
  BOND   =       138.3458  ANGLE   =       184.8487 DIHED      =       119.1381
  1-4 NB =        35.9507  1-4 EEL =       842.2853 VDWAALS    =      9138.2936
  EELEC  =      2970.4534  EGB     =      3580.7687 RESTRAINT  =         0.0000
  ------------------------------------------------------------------------------
Single GPU:
       R M S  F L U C T U A T I O N S
  NSTEP =     1000   TIME(PS) =     102.000  TEMP(K) = 0.91  PRESS =     0.0
  Etot   =        22.9288  EKtot   =        57.4768 EPtot      =        49.0089
  BOND   =        32.3643  ANGLE   =        82.1033 DIHED      =        80.2520
  1-4 NB =        14.6668  1-4 EEL =       107.9454 VDWAALS    =       148.9948
  EELEC  =       280.7436  EGB     =       212.3126 RESTRAINT  =         0.0000
  ------------------------------------------------------------------------------
On 2/12/16 1:16 PM, Ross Walker wrote:
> Hi Sarah,
>
> The error message doesn't tell a lot here unfortunately. The issue with GPU error messages is that if you have an array in memory that contains a NAN - say the force array got an infinite force - then things will be fine 'until' the code tries to upload or download from the GPU (or do a similar copy operation) - NAN's are not supported in these operations and thus you get the error that you see. The real error - e.g. 2 atoms sitting on top of each other occurred somewhere else entirely within the code. The net result is that just because the error is in the gpu_allreduce does not mean it is related to something wrong with the GPUs, or a driver issue or even a multi-GPU issue.
>
> What it likely means there is something wrong with the simulation itself that you are running. Have you tried running on just 1 GPU and see if that crashes as well (and also CPU?). Can you provide some more details about what you are actually simulating.
>
> All the best
> Ross
>
>> On Feb 12, 2016, at 10:42, Sarah Anderson <saraha.cray.com> wrote:
>>
>> Has anyone seen this message lately? I saw some notes about in mid 2015 but no particular fix suggested.
>>
>> This is with cuda 7.5 using a pair of K80 GPUS in peer-to-peer mode.
>>
>> It fails with all combinations of CUDA_VISIBLE_DEVICES 0,1  2,3
>>
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered
>>
>> |------------------- GPU DEVICE INFO --------------------
>> |
>> |                         Task ID:      0
>> |            CUDA_VISIBLE_DEVICES: 0,1
>> |   CUDA Capable Devices Detected:      2
>> |           CUDA Device ID in use:      0
>> |                CUDA Device Name: Tesla K80
>> |     CUDA Device Global Mem Size:  11519 MB
>> | CUDA Device Num Multiprocessors:     13
>> |           CUDA Device Core Freq:   0.82 GHz
>> |
>> |
>> |                         Task ID:      1
>> |            CUDA_VISIBLE_DEVICES: 0,1
>> |   CUDA Capable Devices Detected:      2
>> |           CUDA Device ID in use:      1
>> |                CUDA Device Name: Tesla K80
>> |     CUDA Device Global Mem Size:  11519 MB
>> | CUDA Device Num Multiprocessors:     13
>> |           CUDA Device Core Freq:   0.82 GHz
>> |
>> |--------------------------------------------------------
>>
>> |---------------- GPU PEER TO PEER INFO -----------------
>> |
>> |   Peer to Peer support: ENABLED
>> |
>> |--------------------------------------------------------
>>
>>
>> Here is deviceQuery
>>
>>   CUDA Device Query (Runtime API) version (CUDART static linking)
>>
>> Detected 2 CUDA Capable device(s)
>>
>> Device 0: "Tesla K80"
>>    CUDA Driver Version / Runtime Version          7.5 / 7.5
>>    CUDA Capability Major/Minor version number:    3.7
>>    Total amount of global memory:                 11520 MBytes (12079136768 bytes)
>> MapSMtoCores for SM 3.7 is undefined.  Default to use 192 Cores/SM
>> MapSMtoCores for SM 3.7 is undefined.  Default to use 192 Cores/SM
>>    (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
>>    GPU Clock rate:                                824 MHz (0.82 GHz)
>>    Memory Clock rate:                             2505 Mhz
>>    Memory Bus Width:                              384-bit
>>    L2 Cache Size:                                 1572864 bytes
>>    Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>    Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
>>    Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
>>    Total amount of constant memory:               65536 bytes
>>    Total amount of shared memory per block:       49152 bytes
>>    Total number of registers available per block: 65536
>>    Warp size:                                     32
>>    Maximum number of threads per multiprocessor:  2048
>>    Maximum number of threads per block:           1024
>>    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>    Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
>>    Maximum memory pitch:                          2147483647 bytes
>>    Texture alignment:                             512 bytes
>>    Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
>>    Run time limit on kernels:                     No
>>    Integrated GPU sharing Host Memory:            No
>>    Support host page-locked memory mapping:       Yes
>>    Alignment requirement for Surfaces:            Yes
>>    Device has ECC support:                        Enabled
>>    Device supports Unified Addressing (UVA):      Yes
>>    Device PCI Bus ID / PCI location ID:           5 / 0
>>    Compute Mode:
>>       < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>
>> Device 1: "Tesla K80"
>>    CUDA Driver Version / Runtime Version          7.5 / 7.5
>>    CUDA Capability Major/Minor version number:    3.7
>>    Total amount of global memory:                 11520 MBytes (12079136768 bytes)
>> MapSMtoCores for SM 3.7 is undefined.  Default to use 192 Cores/SM
>> MapSMtoCores for SM 3.7 is undefined.  Default to use 192 Cores/SM
>>    (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
>>    GPU Clock rate:                                824 MHz (0.82 GHz)
>>    Memory Clock rate:                             2505 Mhz
>>    Memory Bus Width:                              384-bit
>>    L2 Cache Size:                                 1572864 bytes
>>    Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
>>    Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
>>    Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
>>    Total amount of constant memory:               65536 bytes
>>    Total amount of shared memory per block:       49152 bytes
>>    Total number of registers available per block: 65536
>>    Warp size:                                     32
>>    Maximum number of threads per multiprocessor:  2048
>>    Maximum number of threads per block:           1024
>>    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>>    Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
>>    Maximum memory pitch:                          2147483647 bytes
>>    Texture alignment:                             512 bytes
>>    Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
>>    Run time limit on kernels:                     No
>>    Integrated GPU sharing Host Memory:            No
>>    Support host page-locked memory mapping:       Yes
>>    Alignment requirement for Surfaces:            Yes
>>    Device has ECC support:                        Enabled
>>    Device supports Unified Addressing (UVA):      Yes
>>    Device PCI Bus ID / PCI location ID:           6 / 0
>>    Compute Mode:
>>       < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
>>> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
>>> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K80, Device1
>> = Tesla K80
>> Result = PASS
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 12 2016 - 14:30:03 PST