[AMBER] multinode pmemd.cuda.MPI jac9999 behavior

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Tue, 21 Mar 2017 15:52:34 -0400

Hi,

On a cluster where 20 nodes have 1 NVIDIA Tesla K40
https://www.osc.edu/resources/technical_support/supercomputers/ruby/technical_specifications
repeated runs of a 2 node JAC9999 benchmark show this behavior:
the first couple (1, 2, or 3) of jobs on a specific node pair work
and most subsequent (but in temporal proximity) jobs on that pair fail.

Jobs usually stop after the 3000 to 6000 nstep printout. The errors
involve illegal memory access, e.g:
cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered

Initial sys admin testing of the hardware doesn't find any issues.
Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
show no problems.

Has anyone else seen behavior like this ?

Script and two example outputs are attached.

thanks,
scott

===
./update_amber --show-applied-patches
AmberTools 16 Applied Patches:
------------------------------
update.1, update.2, update.3, update.4, update.5, update.6, update.7, update.8, update.9, update.10,
update.11, update.12, update.13, update.14, update.15, update.16, update.17, update.18, update.19, update.20,
update.21

Amber 16 Applied Patches:
-------------------------
update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
update.2 (modifies pmemd.cuda.MPI)
update.3 (modifies pmemd)
update.4 (modifies pmemd)
update.5 (modifies pmemd.cuda)
update.6 (modifies pmemd.cuda)
update.7 (modifies pmemd.cuda)
===
===
 short md, nve ensemble
 &cntrl
   ntx=7, irest=1,
   ntc=2, ntf=2, tol=0.0000001,
   nstlim=9999,
   ntpr=1000, ntwr=10000,
   dt=0.001,
   cut=9.,
   ntt=0, temp0=300.,
 &end
 &ewald
  nfft1=64,nfft2=64,nfft3=64,
  skinnb=2.,
 &end
===


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue Mar 21 2017 - 13:00:02 PDT
Custom Search