Re: [AMBER] multinode pmemd.cuda.MPI jac9999 behavior from Ross Walker on 2017-03-23 (Amber Archive Mar 2017)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 23 Mar 2017 19:18:41 -0400

That's not good. Those energies should all be identical. If this is for single CPU runs this suggests something very wrong with your GPUs. Or something is messed up with the intel compilers. What happens if you use gnu compilers?

> On Mar 23, 2017, at 18:58, Scott Brozell <sbrozell.rci.rutgers.edu> wrote:
>
> Hi,
>
> This and/or related variables were already being set, and explicit
> testing does not show any impact by MV2_ENABLE_AFFINITY.
>
> The amber build used intel/16.0.3 and mvapich2/2.2.
> I have attached the outputs of the GPU_Validation_Test.
> There are 4 energies for the small test, but this is considered
> passing for the intel compilers, correct ?
>
> cut -c 17-31 GPU_0.log | rmwhitespace | sort | uniq -c
> r0214gpu
> 2 -58242.4056
> 6 -58225.0889
> 9 -58223.9416
> 3 -58215.9390
> r0218gpu
> 4 -58242.4056
> 7 -58225.0889
> 3 -58223.9416
> 6 -58215.9390
>
> thanks,
> scott
>
> On Tue, Mar 21, 2017 at 01:50:00PM -0700, Niel Henriksen wrote:
>> This is a shot in the dark (lots of variables with cluster
>> hardware/software) ....
>>
>> It looks like you're using mvapich2. I had problems running pmemd.cuda.MPI
>> jobs without setting the following environmental variable:
>>
>> export MV2_ENABLE_AFFINITY=0
>> mpiexec.hydra -f $PBS_NODEFILE -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O ...
>>
>> On Tue, Mar 21, 2017 at 12:52 PM, Scott Brozell <sbrozell.rci.rutgers.edu>
>> wrote:
>>> On a cluster where 20 nodes have 1 NVIDIA Tesla K40
>>> https://www.osc.edu/resources/technical_support/
>>> supercomputers/ruby/technical_specifications
>>> repeated runs of a 2 node JAC9999 benchmark show this behavior:
>>> the first couple (1, 2, or 3) of jobs on a specific node pair work
>>> and most subsequent (but in temporal proximity) jobs on that pair fail.
>>>
>>> Jobs usually stop after the 3000 to 6000 nstep printout. The errors
>>> involve illegal memory access, e.g:
>>> cudaMemcpy GpuBuffer::Download failed an illegal memory access was
>>> encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
>>> encountered
>>>
>>> Initial sys admin testing of the hardware doesn't find any issues.
>>> Repeated single node pmemd.cuda jac9999 benchmarks and other jobs
>>> show no problems.
>>>
>>> Has anyone else seen behavior like this ?
>>>
>>> Script and two example outputs are attached.
>>>
>>> thanks,
>>> scott
>>>
>>> ===
>>> ./update_amber --show-applied-patches
>>> AmberTools 16 Applied Patches:
>>> ------------------------------
>>> update.1, update.2, update.3, update.4, update.5, update.6, update.7,
>>> update.8, update.9, update.10,
>>> update.11, update.12, update.13, update.14, update.15, update.16,
>>> update.17, update.18, update.19, update.20,
>>> update.21
>>>
>>> Amber 16 Applied Patches:
>>> -------------------------
>>> update.1 (modifies pmemd, pmemd.cuda, pmemd.cuda.MPI)
>>> update.2 (modifies pmemd.cuda.MPI)
>>> update.3 (modifies pmemd)
>>> update.4 (modifies pmemd)
>>> update.5 (modifies pmemd.cuda)
>>> update.6 (modifies pmemd.cuda)
>>> update.7 (modifies pmemd.cuda)
>>> ===
>>> ===
>>> short md, nve ensemble
>>> &cntrl
>>> ntx=7, irest=1,
>>> ntc=2, ntf=2, tol=0.0000001,
>>> nstlim=9999,
>>> ntpr=1000, ntwr=10000,
>>> dt=0.001,
>>> cut=9.,
>>> ntt=0, temp0=300.,
>>> &end
>>> &ewald
>>> nfft1=64,nfft2=64,nfft3=64,
>>> skinnb=2.,
>>> &end
>>> ===
> <r0214GPU_0.log><r0218GPU_0.log><r0214GPU.large_0.log><r0218GPU.large_0.log>_______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Mar 23 2017 - 16:30:02 PDT