Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior from Daniel Roe on 2017-04-25 (Amber Archive Apr 2017)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Tue, 25 Apr 2017 09:09:14 -0400

Hi,

My experience with the GPU validation test is that with Intel
compilers I usually end up with final energies flipping between two
different values. With GNU compilers and the same GPUs I only get one
energy each time. This is why I only use GNU compilers for the CUDA
stuff. If there is more variation than that (i.e. 2 values for Intel,
1 for GNU) that indicates a "bad" GPU.

If you get different values with GNU 6.3 but not other versions of GNU
compilers, that's something I haven't seen before.

-Dan

On Mon, Apr 24, 2017 at 4:26 PM, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
> Hi,
>
> 1. GPU_Validation_Test
>
> The differing energies are due to the intel compilers.
> Here are results for the whole cluster:
>
> grep -h Etot `f. GPU_*g` | cut -d':' -f2 | sort | uniq -c
> 105 Etot = -58215.9390 EKtot = 14443.1777 EPtot = -72659.1168
> 81 Etot = -58223.9416 EKtot = 14337.9287 EPtot = -72561.8704
> 104 Etot = -58225.0889 EKtot = 14418.4785 EPtot = -72643.5674
> 110 Etot = -58242.4056 EKtot = 14469.3867 EPtot = -72711.7923
>
> 196 Etot = -2709235.2967 EKtot = 662445.1250 EPtot = -3371680.4217
> 204 Etot = -2710562.0191 EKtot = 661308.3125 EPtot = -3371870.3316
>
> VS gnu 6.3.0
>
> 400 Etot = -58221.5446 EKtot = 14382.0596 EPtot = -72603.6041
>
> 400 Etot = -2710123.9790 EKtot = 662223.0000 EPtot = -3372346.9790
>
> What is the explanation for this ?
>
>
> 2. Sporadic failures of pmemd.cuda.MPI using repeated runs of a 2 node JAC9999 benchmark.
>
> The only news here is that it also happens with a gnu 6.3.0 built pmemd.cuda.MPI.
>
> Is this the only report of such issues ?
>
> thanks,
> scott
>
> On Thu, Mar 23, 2017 at 07:18:41PM -0400, Ross Walker wrote:
>> That's not good. Those energies should all be identical. If this is for single CPU runs this suggests something very wrong with your GPUs. Or something is messed up with the intel compilers. What happens if you use gnu compilers?
>>
>> > On Mar 23, 2017, at 18:58, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
>> >
>> > Hi,
>> >
>> > This and/or related variables were already being set, and explicit
>> > testing does not show any impact by MV2_ENABLE_AFFINITY.
>> >
>> > The amber build used intel/16.0.3 and mvapich2/2.2.
>> > I have attached the outputs of the GPU_Validation_Test.
>> > There are 4 energies for the small test, but this is considered
>> > passing for the intel compilers, correct ?
>> >
>> > cut -c 17-31 GPU_0.log | rmwhitespace | sort | uniq -c
>> > r0214gpu
>> > 2 -58242.4056
>> > 6 -58225.0889
>> > 9 -58223.9416
>> > 3 -58215.9390
>> > r0218gpu
>> > 4 -58242.4056
>> > 7 -58225.0889
>> > 3 -58223.9416
>> > 6 -58215.9390
>> >
>> > thanks,
>> > scott
>> >
>> > On Tue, Mar 21, 2017 at 01:50:00PM -0700, Niel Henriksen wrote:
>> >> This is a shot in the dark (lots of variables with cluster
>> >> hardware/software) ....
>> >>
>> >> It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
>> >> jobs without setting the following environmental variable:
>> >>
>> >> export MV2_ENABLE_AFFINITY=0
>> >> mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
>> >>
>> >> On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
>> >> wrote:
>> >>> On a cluster where 20 nodes have 1 NVIDIA Tesla K40
>> >>> https://www.osc.edu/resources/technical_support/
>> >>> supercomputers/ruby/technical_specifications
>> >>> repeated runs of a 2 node JAC9999 benchmark show this behavior:
>> >>> the first couple (1, 2, or 3) of jobs on a specific node pair work
>> >>> and most subsequent (but in temporal proximity) jobs on that pair fail.
>> >>>
>> >>> Jobs usually stop after the 3000 to 6000 nstep printout. The errors
>> >>> involve illegal memory access, e.g:
>> >>> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
>> >>> encountered
>> >>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
>> >>> encountered
>> >>>
>> >>> Initial sys admin testing of the hardware doesn't find any issues.
>> >>> Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
>> >>> show no problems.
>> >>>
>> >>> Has anyone else seen behavior like this ?
>> >>>
>> >>> Script and two example outputs are attached.
>> >>>
>> >>> thanks,
>> >>> scott
>> >>>
>> >>> ===
>> >>> ./update_amber --show-applied-patches
>> >>> AmberTools 16 Applied Patches:
>> >>> ------------------------------
>> >>> update.1, update.2, update.3, update.4, update.5, update.6, update.7,
>> >>> update.8, update.9, update.10,
>> >>> update.11, update.12, update.13, update.14, update.15, update.16,
>> >>> update.17, update.18, update.19, update.20,
>> >>> update.21
>> >>>
>> >>> Amber 16 Applied Patches:
>> >>> -------------------------
>> >>> update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
>> >>> update.2 (modifies pmemd.cuda.MPI)
>> >>> update.3 (modifies pmemd)
>> >>> update.4 (modifies pmemd)
>> >>> update.5 (modifies pmemd.cuda)
>> >>> update.6 (modifies pmemd.cuda)
>> >>> update.7 (modifies pmemd.cuda)
>> >>> ===
>> >>> ===
>> >>> short md, nve ensemble
>> >>> &cntrl
>> >>> ntx=7, irest=1,
>> >>> ntc=2, ntf=2, tol=0.0000001,
>> >>> nstlim=9999,
>> >>> ntpr=1000, ntwr=10000,
>> >>> dt=0.001,
>> >>> cut=9.,
>> >>> ntt=0, temp0=300.,
>> >>> &end
>> >>> &ewald
>> >>> nfft1=64,nfft2=64,nfft3=64,
>> >>> skinnb=2.,
>> >>> &end
>> >>> ===
>> > <r0214GPU_0.log><r0218GPU_0.log><r0214GPU.large_0.log><r0218GPU.large_0.log>_______________________________________________
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

-- 
-------------------------
Daniel R. Roe
Laboratory of Computational Biology
National Institutes of Health, NHLBI
5635 Fishers Ln, Rm T900
Rockville MD, 20852
https://www.lobos.nih.gov/lcb
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue Apr 25 2017 - 06:30:02 PDT