[AMBER] replica exchange with GPU

From: <bbalta.itu.edu.tr>
Date: Fri, 28 Feb 2020 15:14:59 +0300

We are trying to run a replica exchange simulation using
pmemd.cuda.MPI in Amber 16. The cluster we use is composed of machines
with the following properties: 2 x NVIDIA Tesla P100 GPU, 384 GB
ECC2600 MHz RAM, 2x 20 core Intel Xeon Scalable 6148 CPUs.
We tried different number of replicas. The groupfile (for initial
equilibration) and slurm job submission script we used for 10 replicas
are as follows:
-O -rem 0 -i eq.in.001 -o eq.out.001 -c
baslakcross344.7K_5thns_eq.incrd -r eq.rst.001 -x eq.mdcrd.001 -inf
eq.mdinfo.001 -p 1-388_stripped.top
-O -rem 0 -i eq.in.002 -o eq.out.002 -c
baslakcross344.7K_100thns_eq.incrd -r eq.rst.002 -x eq.mdcrd.002 -inf
eq.mdinfo.002 -p 1-388_stripped.top
-O -rem 0 -i eq.in.003 -o eq.out.003 -c
baslakcross344.7K_180thns_eq.incrd -r eq.rst.003 -x eq.mdcrd.003 -inf
eq.mdinfo.003 -p 1-388_stripped.top
-O -rem 0 -i eq.in.004 -o eq.out.004 -c
baslakcross344.7K_225thns_eq.incrd -r eq.rst.004 -x eq.mdcrd.004 -inf
eq.mdinfo.004 -p 1-388_stripped.top
-O -rem 0 -i eq.in.005 -o eq.out.005 -c
baslakcross344.7K_495thns_eq.incrd -r eq.rst.005 -x eq.mdcrd.005 -inf
eq.mdinfo.005 -p 1-388_stripped.top
-O -rem 0 -i eq.in.006 -o eq.out.006 -c
baslakcross344.7K_650thns_eq.incrd -r eq.rst.006 -x eq.mdcrd.006 -inf
eq.mdinfo.006 -p 1-388_stripped.top
-O -rem 0 -i eq.in.007 -o eq.out.007 -c baslakcross344.7K_c0_eq.incrd
-r eq.rst.007 -x eq.mdcrd.007 -inf eq.mdinfo.007 -p 1-388_stripped.top
-O -rem 0 -i eq.in.008 -o eq.out.008 -c baslakcross344.7K_c1_eq.incrd
-r eq.rst.008 -x eq.mdcrd.008 -inf eq.mdinfo.008 -p 1-388_stripped.top
-O -rem 0 -i eq.in.009 -o eq.out.009 -c baslakcross344.7K_c2_eq.incrd
-r eq.rst.009 -x eq.mdcrd.009 -inf eq.mdinfo.009 -p 1-388_stripped.top
-O -rem 0 -i eq.in.010 -o eq.out.010 -c baslakcross344.7K_c3_eq.incrd
-r eq.rst.010 -x eq.mdcrd.010 -inf eq.mdinfo.010 -p 1-388_stripped.top

#SBATCH -p barbun-cuda
#SBATCH -A uucar
#SBATCH -J 1-388remd201
#SBATCH -n 20
#SBATCH --gres=gpu:2
#SBATCH --time=1-00:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
module load centos7.3/lib/openmpi/1.8.8-gcc-4.8.5
module load centos7.3/lib/cuda/9.0
mpirun $AMBER_DIR/bin/pmemd.cuda.MPI -ng 10 -groupfile deneme_eq.groupfile

Amber produced an output file but the output terminated after writing
the result of the 0th step:
  NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS
= 0.0
  Etot = -134342.6908 EKtot = 0.0000 EPtot =
  BOND = 19102.2338 ANGLE = 3092.7537 DIHED =
  1-4 NB = 1347.0833 1-4 EEL = 17091.3523 VDWAALS =
  EELEC = -196328.6342 EHBOND = 0.0000 RESTRAINT =

The error message given by the system is:
  Running multipmemd version of pmemd Amber16
     Total processors = 20
     Number of groups = 10
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

   Process name: [[13854,1],12]
   Exit code: 255

We also tried with 40 replicas, different number of CPUs (up to 80), 3
GPUs on 2 different machines, single GPU with pmemd.cuda (without
MPI). But none of them worked. With 40 replicas and 80 CPUs we
obtained the following eeror in addition to the ones mentioned above:
cudaMalloc GpuBuffer::Allocate failed out of memory

It seems from the manual and mailing list that Amber 16 supports REMD
on GPUs. So probably we make a mistake at some point. Any help will be
greatly appreciated.
Thank you.

AMBER mailing list
Received on Fri Feb 28 2020 - 04:30:02 PST
Custom Search