[AMBER] Installation & testing of AMBER18 parallel GPU problem

From: Ibrahim M. Moustafa <ria2.psu.edu>
Date: Fri, 30 Nov 2018 12:38:11 -0500

Dear all,
   We are trying to install AMBER18 on our server running Debian 8 & cuda
8.0.61 following the manual instructions. The compilation went through except
the parallel GPU part which failed the tests. We proceeded to check the
installation with Amber18 Benchmarks (Ross Walker & Dave Cerutti) on a
workstation with 4 1080Ti GPUs. The outputs from CPU & individual GPU runs
look normal except the following message appeared in each run:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
  The timings for those runs are comparable to that reported for the
benchmarks.
  However, for the parallel GPU runs, jobs were terminated and the following
error appeared:"gpu_allreduce cudaDevicesSynchronize failed an illegal memory
access was encountered"
 Checking the outputs of the multiple GPU runs showed NAN for TEMP, Etot and
EKtot (which was not the case when running on 1 GPU or CPU).
  So, any idea what could be wrong with the parallel GPU compilation? is that
something related to AMBER18 or we are missing something!.
Last year I had AMBER16 installed on a similar workstation with 4x GPU 1080Ti
cards and did not encounter any of these problems!
  Your comment and feedback to help resolving this problem will be very
appreciated.
  Thanks, Ibrahim
   Below is part of the output from running the RW benchmark:
------------------------------------------------------------------------------Amber18_Benchmark_Suite_RWThe output of run_bench_CPU+GPU.sh script:
JAC_PRODUCTION_NVE - 23,558 atoms PME 4fs-----------------------------------------
CPU code 16 cores: | ns/day = 42.79 seconds/ns = 2019.18 [0] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 728.43 seconds/ns = 118.61 [1] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 745.12 seconds/ns = 115.95 [2] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 754.48 seconds/ns = 114.52 [3] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 675.99 seconds/ns = 127.81Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are!
  signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALMultiple Single GPU Run Performance [0] 1 x GPU: | ns/day = 724.71 seconds/ns = 119.22 [1] 1 x GPU: | ns/day = 740.73 seconds/ns = 116.64 [2] 1 x GPU: | ns/day = 748.96 seconds/ns = 115.36 [3] 1 x GPU: | ns/day = 696.69 seconds/ns = 124.02gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered-------------------------------------------------------Primary job terminated normally, but 1 process returneda non-zero exit code.. Per user-direction, the job has been aborted.---------------------------------------------------------------------------------------------------------------------------------mpirun detected that one or more processes exited with non-zero status, thus causingthe job to be terminated. The first proce!
 ss to do so was:
  Process name: [[2842,1],0] Exit code: 255-------------------------------------------------------------------------- P2P 2 x GPU: grep: mdinfo.2GPU: No such file or directorygpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered-------------------------------------------------------Primary job terminated normally, but 1 process returneda non-zero exit code.. Per user-direction, the job has been aborted.---------------------------------------------------------------------------------------------------------------------------------mpirun detected that one or more processes exited with non-zero status, thus causingthe job to be terminated. The first process to do so was:
  Process name: [[3263,1],1] Exit code: 255-------------------------------------------------------------------------- P2P 4 x GPU: grep: mdinfo.4GPU: No such file or directorygpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encounteredgpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered-------------------------------------------------------Primary job terminated normally, but 1 process returneda non-zero exit code.. Per user-direction, the job has been aborted.--------------------------------------------------------------------------------------------------------------Primary job terminated normally, but 1 process returneda non-zero exit code.. Per user-direction, the job has been aborted.---------------------------------------------------------------------------------------------------------------------------------mpirun detected that one or more processes exited with non-zero status, thus causingthe job to be!
  terminated. The first process to do so was:
  Process name: [[3112,1],0] Exit code: 255----------------------------------------------------------------------------------------------------------------------------------------------------mpirun detected that one or more processes exited with non-zero status, thus causingthe job to be terminated. The first process to do so was:
  Process name: [[3115,1],0] Exit code: 255--------------------------------------------------------------------------Multiple 2xGPU Run Performance [0,1] 2 x GPU: grep: mdinfo.2GPU_0: No such file or directory [2,3] 2 x GPU: grep: mdinfo.2GPU_2: No such file or directory
JAC_PRODUCTION_NPT - 23,558 atoms PME 4fs-----------------------------------------
CPU code 16 cores: | ns/day = 51.27 seconds/ns = 1685.35 [0] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 687.73 seconds/ns = 125.63 [1] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 703.93 seconds/ns = 122.74 [2] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 712.99 seconds/ns = 121.18 [3] 1 x GPU: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL| ns/day = 698.04 seconds/ns = 123.78Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are!
  signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALNote: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMALMultiple Single GPU Run Performance [0] 1 x GPU: | ns/day = 688.73 seconds/ns = 125.45 [1] 1 x GPU: | ns/day = 704.92 seconds/ns = 122.57 [2] 1 x GPU: | ns/day = 707.74 seconds/ns = 122.08 [3] 1 x GPU: | ns/day = 662.15 seconds/ns = 130.48
Ibrahim M.Moustafa, Ph.D.
Pennsylvania State University
Biochemistry & Molecular Biology Dept.
Millennium Science Complex
University Park, PA16802

Tel (814)863 5940
Fax (814)865 7927


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Nov 30 2018 - 10:00:01 PST
Custom Search