Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior

From: Niel Henriksen <shireham.gmail.com>
Date: Tue, 21 Mar 2017 13:50:00 -0700

This is a shot in the dark (lots of variables with cluster
hardware/software) ....

It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
jobs without setting the following environmental variable:

export MV2_ENABLE_AFFINITY=0
mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...

Best,
--Niel


On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
wrote:

> Hi,
>
> On a cluster where 20 nodes have 1 NVIDIA Tesla K40
> https://www.osc.edu/resources/technical_support/
> supercomputers/ruby/technical_specifications
> repeated runs of a 2 node JAC9999 benchmark show this behavior:
> the first couple (1, 2, or 3) of jobs on a specific node pair work
> and most subsequent (but in temporal proximity) jobs on that pair fail.
>
> Jobs usually stop after the 3000 to 6000 nstep printout. The errors
> involve illegal memory access, e.g:
> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> encountered
> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> encountered
>
> Initial sys admin testing of the hardware doesn't find any issues.
> Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
> show no problems.
>
> Has anyone else seen behavior like this ?
>
> Script and two example outputs are attached.
>
> thanks,
> scott
>
> ===
> ./update_amber --show-applied-patches
> AmberTools 16 Applied Patches:
> ------------------------------
> update.1, update.2, update.3, update.4, update.5, update.6, update.7,
> update.8, update.9, update.10,
> update.11, update.12, update.13, update.14, update.15, update.16,
> update.17, update.18, update.19, update.20,
> update.21
>
> Amber 16 Applied Patches:
> -------------------------
> update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
> update.2 (modifies pmemd.cuda.MPI)
> update.3 (modifies pmemd)
> update.4 (modifies pmemd)
> update.5 (modifies pmemd.cuda)
> update.6 (modifies pmemd.cuda)
> update.7 (modifies pmemd.cuda)
> ===
> ===
> short md, nve ensemble
> &cntrl
> ntx=7, irest=1,
> ntc=2, ntf=2, tol=0.0000001,
> nstlim=9999,
> ntpr=1000, ntwr=10000,
> dt=0.001,
> cut=9.,
> ntt=0, temp0=300.,
> &end
> &ewald
> nfft1=64,nfft2=64,nfft3=64,
> skinnb=2.,
> &end
> ===
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Mar 21 2017 - 14:00:02 PDT
Custom Search