[AMBER] replica exchange with GPU

From: <bbalta.itu.edu.tr>
Date: Fri, 28 Feb 2020 15:14:59 +0300

Hello,
We are trying to run a replica exchange simulation using
pmemd.cuda.MPI in Amber 16. The cluster we use is composed of machines
with the following properties: 2 x NVIDIA Tesla P100 GPU, 384 GB
ECC2600 MHz RAM, 2x 20 core Intel Xeon Scalable 6148 CPUs.
We tried different number of replicas. The groupfile (for initial
equilibration) and slurm job submission script we used for 10 replicas
are as follows:
-O -rem 0 -i eq.in.001 -o eq.out.001 -c
baslakcross344.7K_5thns_eq.incrd -r eq.rst.001 -x eq.mdcrd.001 -inf
eq.mdinfo.001 -p 1-388_stripped.top
-O -rem 0 -i eq.in.002 -o eq.out.002 -c
baslakcross344.7K_100thns_eq.incrd -r eq.rst.002 -x eq.mdcrd.002 -inf
eq.mdinfo.002 -p 1-388_stripped.top
-O -rem 0 -i eq.in.003 -o eq.out.003 -c
baslakcross344.7K_180thns_eq.incrd -r eq.rst.003 -x eq.mdcrd.003 -inf
eq.mdinfo.003 -p 1-388_stripped.top
-O -rem 0 -i eq.in.004 -o eq.out.004 -c
baslakcross344.7K_225thns_eq.incrd -r eq.rst.004 -x eq.mdcrd.004 -inf
eq.mdinfo.004 -p 1-388_stripped.top
-O -rem 0 -i eq.in.005 -o eq.out.005 -c
baslakcross344.7K_495thns_eq.incrd -r eq.rst.005 -x eq.mdcrd.005 -inf
eq.mdinfo.005 -p 1-388_stripped.top
-O -rem 0 -i eq.in.006 -o eq.out.006 -c
baslakcross344.7K_650thns_eq.incrd -r eq.rst.006 -x eq.mdcrd.006 -inf
eq.mdinfo.006 -p 1-388_stripped.top
-O -rem 0 -i eq.in.007 -o eq.out.007 -c baslakcross344.7K_c0_eq.incrd
-r eq.rst.007 -x eq.mdcrd.007 -inf eq.mdinfo.007 -p 1-388_stripped.top
-O -rem 0 -i eq.in.008 -o eq.out.008 -c baslakcross344.7K_c1_eq.incrd
-r eq.rst.008 -x eq.mdcrd.008 -inf eq.mdinfo.008 -p 1-388_stripped.top
-O -rem 0 -i eq.in.009 -o eq.out.009 -c baslakcross344.7K_c2_eq.incrd
-r eq.rst.009 -x eq.mdcrd.009 -inf eq.mdinfo.009 -p 1-388_stripped.top
-O -rem 0 -i eq.in.010 -o eq.out.010 -c baslakcross344.7K_c3_eq.incrd
-r eq.rst.010 -x eq.mdcrd.010 -inf eq.mdinfo.010 -p 1-388_stripped.top

#!/bin/bash
#SBATCH -p barbun-cuda
#SBATCH -A uucar
#SBATCH -J 1-388remd201
#SBATCH -N 1
#SBATCH -n 20
#SBATCH --gres=gpu:2
#SBATCH --time=1-00:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH -D
/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
export
WORKDIR=/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
cd $WORKDIR
module load centos7.3/lib/openmpi/1.8.8-gcc-4.8.5
module load centos7.3/lib/cuda/9.0
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
mpirun $AMBER_DIR/bin/pmemd.cuda.MPI -ng 10 -groupfile deneme_eq.groupfile

Amber produced an output file but the output terminated after writing
the result of the 0th step:
  NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS
= 0.0
  Etot = -134342.6908 EKtot = 0.0000 EPtot =
-134342.6908
  BOND = 19102.2338 ANGLE = 3092.7537 DIHED =
4080.3001
  1-4 NB = 1347.0833 1-4 EEL = 17091.3523 VDWAALS =
17272.2203
  EELEC = -196328.6342 EHBOND = 0.0000 RESTRAINT =
    0.0000
   
------------------------------------------------------------------------------

The error message given by the system is:
SLURM_NODELIST barbun122
NUMBER OF CORES 20
  Running multipmemd version of pmemd Amber16
     Total processors = 20
     Number of groups = 10
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

   Process name: [[13854,1],12]
   Exit code: 255


We also tried with 40 replicas, different number of CPUs (up to 80), 3
GPUs on 2 different machines, single GPU with pmemd.cuda (without
MPI). But none of them worked. With 40 replicas and 80 CPUs we
obtained the following eeror in addition to the ones mentioned above:
cudaMalloc GpuBuffer::Allocate failed out of memory


It seems from the manual and mailing list that Amber 16 supports REMD
on GPUs. So probably we make a mistake at some point. Any help will be
greatly appreciated.
Thank you.




_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 28 2020 - 04:30:02 PST
Custom Search