[AMBER] multinode pmemd.cuda.MPI jac9999 behavior

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Tue, 21 Mar 2017 15:52:34 -0400


On a cluster where 20 nodes have 1 NVIDIA Tesla K40
repeated runs of a 2 node JAC9999 benchmark show this behavior:
the first couple (1, 2, or 3) of jobs on a specific node pair work
and most subsequent (but in temporal proximity) jobs on that pair fail.

Jobs usually stop after the 3000 to 6000 nstep printout. The errors
involve illegal memory access, e.g:
cudaMemcpy GpuBuffer::Download failed an illegal memory access was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered

Initial sys admin testing of the hardware doesn't find any issues.
Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
show no problems.

Has anyone else seen behavior like this ?

Script and two example outputs are attached.


./update_amber --show-applied-patches
AmberTools 16 Applied Patches:
update.1, update.2, update.3, update.4, update.5, update.6, update.7, update.8, update.9, update.10,
update.11, update.12, update.13, update.14, update.15, update.16, update.17, update.18, update.19, update.20,

Amber 16 Applied Patches:
update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
update.2 (modifies pmemd.cuda.MPI)
update.3 (modifies pmemd)
update.4 (modifies pmemd)
update.5 (modifies pmemd.cuda)
update.6 (modifies pmemd.cuda)
update.7 (modifies pmemd.cuda)
 short md, nve ensemble
   ntx=7, irest=1,
   ntc=2, ntf=2, tol=0.0000001,
   ntpr=1000, ntwr=10000,
   ntt=0, temp0=300.,

AMBER mailing list

Received on Tue Mar 21 2017 - 13:00:02 PDT
Custom Search