Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior from Scott Brozell on 2017-04-25 (Amber Archive Apr 2017)

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Tue, 25 Apr 2017 11:44:50 -0400

Hi,

On 1.:
Perhaps it was not clear, but i showed both the small and large
test results. In other words, intel gives 4 energies for small
and 2 energies for large; gnu gives 1 energy for small and
1 energy for large.

There have also been multiple experiments over the whole cluster
yielding the same exact energies.

This certainly seems like good gpu's and some issue with intel
compilers. Perhaps it is time to contact our intel colleagues
if we have no explanation.

thanks,
scott

On Tue, Apr 25, 2017 at 09:09:14AM -0400, Daniel Roe wrote:
>
> My experience with the GPU validation test is that with Intel
> compilers I usually end up with final energies flipping between two
> different values. With GNU compilers and the same GPUs I only get one
> energy each time. This is why I only use GNU compilers for the CUDA
> stuff. If there is more variation than that (i.e. 2 values for Intel,
> 1 for GNU) that indicates a "bad" GPU.
>
> If you get different values with GNU 6.3 but not other versions of GNU
> compilers, that's something I haven't seen before.
>
> -Dan
>
> On Mon, Apr 24, 2017 at 4:26 PM, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
> > Hi,
> >
> > 1. GPU_Validation_Test
> >
> > The differing energies are due to the intel compilers.
> > Here are results for the whole cluster:
> >
> > grep -h Etot `f. GPU_*g` | cut -d':' -f2 | sort | uniq -c
> > 105 Etot = -58215.9390 EKtot = 14443.1777 EPtot = -72659.1168
> > 81 Etot = -58223.9416 EKtot = 14337.9287 EPtot = -72561.8704
> > 104 Etot = -58225.0889 EKtot = 14418.4785 EPtot = -72643.5674
> > 110 Etot = -58242.4056 EKtot = 14469.3867 EPtot = -72711.7923
> >
> > 196 Etot = -2709235.2967 EKtot = 662445.1250 EPtot = -3371680.4217
> > 204 Etot = -2710562.0191 EKtot = 661308.3125 EPtot = -3371870.3316
> >
> > VS gnu 6.3.0
> >
> > 400 Etot = -58221.5446 EKtot = 14382.0596 EPtot = -72603.6041
> >
> > 400 Etot = -2710123.9790 EKtot = 662223.0000 EPtot = -3372346.9790
> >
> > What is the explanation for this ?
> >
> >
> > 2. Sporadic failures of pmemd.cuda.MPI using repeated runs of a 2 node JAC9999 benchmark.
> >
> > The only news here is that it also happens with a gnu 6.3.0 built pmemd.cuda.MPI.
> >
> > Is this the only report of such issues ?
> >
> > thanks,
> > scott
> >
> > On Thu, Mar 23, 2017 at 07:18:41PM -0400, Ross Walker wrote:
> >> That's not good. Those energies should all be identical. If this is for single CPU runs this suggests something very wrong with your GPUs. Or something is messed up with the intel compilers. What happens if you use gnu compilers?
> >>
> >> > On Mar 23, 2017, at 18:58, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
> >> >
> >> > Hi,
> >> >
> >> > This and/or related variables were already being set, and explicit
> >> > testing does not show any impact by MV2_ENABLE_AFFINITY.
> >> >
> >> > The amber build used intel/16.0.3 and mvapich2/2.2.
> >> > I have attached the outputs of the GPU_Validation_Test.
> >> > There are 4 energies for the small test, but this is considered
> >> > passing for the intel compilers, correct ?
> >> >
> >> > cut -c 17-31 GPU_0.log | rmwhitespace | sort | uniq -c
> >> > r0214gpu
> >> > 2 -58242.4056
> >> > 6 -58225.0889
> >> > 9 -58223.9416
> >> > 3 -58215.9390
> >> > r0218gpu
> >> > 4 -58242.4056
> >> > 7 -58225.0889
> >> > 3 -58223.9416
> >> > 6 -58215.9390
> >> >
> >> > thanks,
> >> > scott
> >> >
> >> > On Tue, Mar 21, 2017 at 01:50:00PM -0700, Niel Henriksen wrote:
> >> >> This is a shot in the dark (lots of variables with cluster
> >> >> hardware/software) ....
> >> >>
> >> >> It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
> >> >> jobs without setting the following environmental variable:
> >> >>
> >> >> export MV2_ENABLE_AFFINITY=0
> >> >> mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
> >> >>
> >> >> On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
> >> >> wrote:
> >> >>> On a cluster where 20 nodes have 1 NVIDIA Tesla K40
> >> >>> https://www.osc.edu/resources/technical_support/
> >> >>> supercomputers/ruby/technical_specifications
> >> >>> repeated runs of a 2 node JAC9999 benchmark show this behavior:
> >> >>> the first couple (1, 2, or 3) of jobs on a specific node pair work
> >> >>> and most subsequent (but in temporal proximity) jobs on that pair fail.
> >> >>>
> >> >>> Jobs usually stop after the 3000 to 6000 nstep printout. The errors
> >> >>> involve illegal memory access, e.g:
> >> >>> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> >> >>> encountered
> >> >>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> >> >>> encountered
> >> >>>
> >> >>> Initial sys admin testing of the hardware doesn't find any issues.
> >> >>> Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
> >> >>> show no problems.
> >> >>>
> >> >>> Has anyone else seen behavior like this ?
> >> >>>
> >> >>> Script and two example outputs are attached.
> >> >>>
> >> >>> thanks,
> >> >>> scott
> >> >>>
> >> >>> ===
> >> >>> ./update_amber --show-applied-patches
> >> >>> AmberTools 16 Applied Patches:
> >> >>> ------------------------------
> >> >>> update.1, update.2, update.3, update.4, update.5, update.6, update.7,
> >> >>> update.8, update.9, update.10,
> >> >>> update.11, update.12, update.13, update.14, update.15, update.16,
> >> >>> update.17, update.18, update.19, update.20,
> >> >>> update.21
> >> >>>
> >> >>> Amber 16 Applied Patches:
> >> >>> -------------------------
> >> >>> update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
> >> >>> update.2 (modifies pmemd.cuda.MPI)
> >> >>> update.3 (modifies pmemd)
> >> >>> update.4 (modifies pmemd)
> >> >>> update.5 (modifies pmemd.cuda)
> >> >>> update.6 (modifies pmemd.cuda)
> >> >>> update.7 (modifies pmemd.cuda)
> >> >>> ===
> >> >>> ===
> >> >>> short md, nve ensemble
> >> >>> &cntrl
> >> >>> ntx=7, irest=1,
> >> >>> ntc=2, ntf=2, tol=0.0000001,
> >> >>> nstlim=9999,
> >> >>> ntpr=1000, ntwr=10000,
> >> >>> dt=0.001,
> >> >>> cut=9.,
> >> >>> ntt=0, temp0=300.,
> >> >>> &end
> >> >>> &ewald
> >> >>> nfft1=64,nfft2=64,nfft3=64,
> >> >>> skinnb=2.,
> >> >>> &end
> >> >>> ===
> >> > <r0214GPU_0.log><r0218GPU_0.log><r0214GPU.large_0.log><r0218GPU.large_0.log>_______________________________________________

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 25 2017 - 09:00:02 PDT