Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior from Scott Brozell on 2017-03-23 (Amber Archive Mar 2017)

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Thu, 23 Mar 2017 18:58:52 -0400

Hi,

This and/or related variables were already being set, and explicit
testing does not show any impact by MV2_ENABLE_AFFINITY.

The amber build used intel/16.0.3 and mvapich2/2.2.
I have attached the outputs of the GPU_Validation_Test.
There are 4 energies for the small test, but this is considered
passing for the intel compilers, correct ?

cut -c 17-31 GPU_0.log | rmwhitespace | sort | uniq -c
r0214gpu
      2 -58242.4056
      6 -58225.0889
      9 -58223.9416
      3 -58215.9390
r0218gpu
      4 -58242.4056
      7 -58225.0889
      3 -58223.9416
      6 -58215.9390

thanks,
scott

On Tue, Mar 21, 2017 at 01:50:00PM -0700, Niel Henriksen wrote:
> This is a shot in the dark (lots of variables with cluster
> hardware/software) ....
>
> It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
> jobs without setting the following environmental variable:
>
> export MV2_ENABLE_AFFINITY=0
> mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
>
> On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
> wrote:
> > On a cluster where 20 nodes have 1 NVIDIA Tesla K40
> > https://www.osc.edu/resources/technical_support/
> > supercomputers/ruby/technical_specifications
> > repeated runs of a 2 node JAC9999 benchmark show this behavior:
> > the first couple (1, 2, or 3) of jobs on a specific node pair work
> > and most subsequent (but in temporal proximity) jobs on that pair fail.
> >
> > Jobs usually stop after the 3000 to 6000 nstep printout. The errors
> > involve illegal memory access, e.g:
> > cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> > encountered
> > gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> > encountered
> >
> > Initial sys admin testing of the hardware doesn't find any issues.
> > Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
> > show no problems.
> >
> > Has anyone else seen behavior like this ?
> >
> > Script and two example outputs are attached.
> >
> > thanks,
> > scott
> >
> > ===
> > ./update_amber --show-applied-patches
> > AmberTools 16 Applied Patches:
> > ------------------------------
> > update.1, update.2, update.3, update.4, update.5, update.6, update.7,
> > update.8, update.9, update.10,
> > update.11, update.12, update.13, update.14, update.15, update.16,
> > update.17, update.18, update.19, update.20,
> > update.21
> >
> > Amber 16 Applied Patches:
> > -------------------------
> > update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
> > update.2 (modifies pmemd.cuda.MPI)
> > update.3 (modifies pmemd)
> > update.4 (modifies pmemd)
> > update.5 (modifies pmemd.cuda)
> > update.6 (modifies pmemd.cuda)
> > update.7 (modifies pmemd.cuda)
> > ===
> > ===
> > short md, nve ensemble
> > &cntrl
> > ntx=7, irest=1,
> > ntc=2, ntf=2, tol=0.0000001,
> > nstlim=9999,
> > ntpr=1000, ntwr=10000,
> > dt=0.001,
> > cut=9.,
> > ntt=0, temp0=300.,
> > &end
> > &ewald
> > nfft1=64,nfft2=64,nfft3=64,
> > skinnb=2.,
> > &end
> > ===

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

text/plain attachment: r0214GPU_0.log

text/plain attachment: r0218GPU_0.log

text/plain attachment: r0214GPU.large_0.log

text/plain attachment: r0218GPU.large_0.log

Received on Thu Mar 23 2017 - 16:00:05 PDT