Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior from Scott Brozell on 2017-04-24 (Amber Archive Apr 2017)

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Mon, 24 Apr 2017 16:26:21 -0400

Hi,

1. GPU_Validation_Test

The differing energies are due to the intel compilers.
Here are results for the whole cluster:

grep -h Etot `f. GPU_*g` | cut -d':' -f2 | sort | uniq -c
    105 Etot = -58215.9390 EKtot = 14443.1777 EPtot = -72659.1168
     81 Etot = -58223.9416 EKtot = 14337.9287 EPtot = -72561.8704
    104 Etot = -58225.0889 EKtot = 14418.4785 EPtot = -72643.5674
    110 Etot = -58242.4056 EKtot = 14469.3867 EPtot = -72711.7923

    196 Etot = -2709235.2967 EKtot = 662445.1250 EPtot = -3371680.4217
    204 Etot = -2710562.0191 EKtot = 661308.3125 EPtot = -3371870.3316

VS gnu 6.3.0

    400 Etot = -58221.5446 EKtot = 14382.0596 EPtot = -72603.6041

    400 Etot = -2710123.9790 EKtot = 662223.0000 EPtot = -3372346.9790

What is the explanation for this ?

2. Sporadic failures of pmemd.cuda.MPI using repeated runs of a 2 node JAC9999 benchmark.

The only news here is that it also happens with a gnu 6.3.0 built pmemd.cuda.MPI.

Is this the only report of such issues ?

thanks,
scott

On Thu, Mar 23, 2017 at 07:18:41PM -0400, Ross Walker wrote:
> That's not good. Those energies should all be identical. If this is for single CPU runs this suggests something very wrong with your GPUs. Or something is messed up with the intel compilers. What happens if you use gnu compilers?
>
> > On Mar 23, 2017, at 18:58, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
> >
> > Hi,
> >
> > This and/or related variables were already being set, and explicit
> > testing does not show any impact by MV2_ENABLE_AFFINITY.
> >
> > The amber build used intel/16.0.3 and mvapich2/2.2.
> > I have attached the outputs of the GPU_Validation_Test.
> > There are 4 energies for the small test, but this is considered
> > passing for the intel compilers, correct ?
> >
> > cut -c 17-31 GPU_0.log | rmwhitespace | sort | uniq -c
> > r0214gpu
> > 2 -58242.4056
> > 6 -58225.0889
> > 9 -58223.9416
> > 3 -58215.9390
> > r0218gpu
> > 4 -58242.4056
> > 7 -58225.0889
> > 3 -58223.9416
> > 6 -58215.9390
> >
> > thanks,
> > scott
> >
> > On Tue, Mar 21, 2017 at 01:50:00PM -0700, Niel Henriksen wrote:
> >> This is a shot in the dark (lots of variables with cluster
> >> hardware/software) ....
> >>
> >> It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
> >> jobs without setting the following environmental variable:
> >>
> >> export MV2_ENABLE_AFFINITY=0
> >> mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
> >>
> >> On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
> >> wrote:
> >>> On a cluster where 20 nodes have 1 NVIDIA Tesla K40
> >>> https://www.osc.edu/resources/technical_support/
> >>> supercomputers/ruby/technical_specifications
> >>> repeated runs of a 2 node JAC9999 benchmark show this behavior:
> >>> the first couple (1, 2, or 3) of jobs on a specific node pair work
> >>> and most subsequent (but in temporal proximity) jobs on that pair fail.
> >>>
> >>> Jobs usually stop after the 3000 to 6000 nstep printout. The errors
> >>> involve illegal memory access, e.g:
> >>> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> >>> encountered
> >>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> >>> encountered
> >>>
> >>> Initial sys admin testing of the hardware doesn't find any issues.
> >>> Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
> >>> show no problems.
> >>>
> >>> Has anyone else seen behavior like this ?
> >>>
> >>> Script and two example outputs are attached.
> >>>
> >>> thanks,
> >>> scott
> >>>
> >>> ===
> >>> ./update_amber --show-applied-patches
> >>> AmberTools 16 Applied Patches:
> >>> ------------------------------
> >>> update.1, update.2, update.3, update.4, update.5, update.6, update.7,
> >>> update.8, update.9, update.10,
> >>> update.11, update.12, update.13, update.14, update.15, update.16,
> >>> update.17, update.18, update.19, update.20,
> >>> update.21
> >>>
> >>> Amber 16 Applied Patches:
> >>> -------------------------
> >>> update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
> >>> update.2 (modifies pmemd.cuda.MPI)
> >>> update.3 (modifies pmemd)
> >>> update.4 (modifies pmemd)
> >>> update.5 (modifies pmemd.cuda)
> >>> update.6 (modifies pmemd.cuda)
> >>> update.7 (modifies pmemd.cuda)
> >>> ===
> >>> ===
> >>> short md, nve ensemble
> >>> &cntrl
> >>> ntx=7, irest=1,
> >>> ntc=2, ntf=2, tol=0.0000001,
> >>> nstlim=9999,
> >>> ntpr=1000, ntwr=10000,
> >>> dt=0.001,
> >>> cut=9.,
> >>> ntt=0, temp0=300.,
> >>> &end
> >>> &ewald
> >>> nfft1=64,nfft2=64,nfft3=64,
> >>> skinnb=2.,
> >>> &end
> >>> ===
> > <r0214GPU_0.log><r0218GPU_0.log><r0214GPU.large_0.log><r0218GPU.large_0.log>_______________________________________________

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Apr 24 2017 - 13:30:03 PDT