Re: [AMBER] replica exchange with GPU from Cruzeiro,Vinicius Wilian D on 2020-03-02 (Amber Archive Mar 2020)

From: Cruzeiro,Vinicius Wilian D <vwcruzeiro.ufl.edu>
Date: Mon, 2 Mar 2020 16:45:44 +0000

Hello,

pmemd was designed to use one replica per GPU. If you attempt to use two or more replicas per GPUs, you will either have very significant slowdowns or memory issues (as you are having).

I advise you to try to run with one replica per GPU. It should work.

I hope this helps,
Best,

Vinícius Wilian D Cruzeiro

PhD Candidate
Department of Chemistry, Physical Chemistry Division
University of Florida, United States

Voice: +1(352)846-1633<tel:+1(352)846-1633>

On Mar 2, 2020, at 2:44 AM, Alessandro Contini <alessandro.contini.unimi.it> wrote:

[External Email]

Hi,
we had some similar problem while setting up a GPU cluster running SLURM
as the scheduler.
After properly configuring slurm, we solved our issue with REMD by using
this script:

#!/bin/bash
#SBATCH --job-name tremd_amber
#SBATCH --nodes=2
#SBATCH --gres=gpu:2 # number of GPUs per node
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=6 # number of replica in groupfile / number of
nodes
#SBATCH --output job.out
#SBATCH --error job.err
# start from launch dir
cd $SLURM_SUBMIT_DIR

# load amber module + mpi environment
source /amber18/amber18.sh

# Calculate the number of replica
NG=`cat remd.groupfile | wc -l`

# do remd (no backbone restraints production run)
mpirun --mca btl_tcp_if_include 10.0.0.0/24 -np $NG pmemd.cuda.MPI -ng
$NG -groupfile remd.groupfile
wait
exit

The "--mca btl_tcp_if_include 10.0.0.0/24" is needed for mpirun to work
properly on our cluster and should be updated accordingly to your network

Hope this helps,

Best regards

Alessandro

Il 02/03/2020 10:21, bbalta.itu.edu.tr ha scritto:
Yes. The same calculation runs successfully on CPUs. And, standard MD
with one of these replicas runs also successfully. So, it seems that
the problem is not with our mdin, topology and groupfile files or cuda
installation. The problem arises only when we try to run REMD on GPUs.

Alıntı Carlos Simmerling <carlos.simmerling.gmail.com>:

does this work if you test it on CPUs only instead of GPUs? It might help
you check if all of the components of your script, groupfile and mdin are
correct.
also, have you run GPU standard MD using these same inputs (no REMD, just a
single MD from one of these prmtop/inpcrd and same mdin)

On Fri, Feb 28, 2020 at 7:15 AM <bbalta.itu.edu.tr> wrote:

Hello,
We are trying to run a replica exchange simulation using
pmemd.cuda.MPI in Amber 16. The cluster we use is composed of machines
with the following properties: 2 x NVIDIA Tesla P100 GPU, 384 GB
ECC2600 MHz RAM, 2x 20 core Intel Xeon Scalable 6148 CPUs.
We tried different number of replicas. The groupfile (for initial
equilibration) and slurm job submission script we used for 10 replicas
are as follows:
-O -rem 0 -i eq.in.001 -o eq.out.001 -c
baslakcross344.7K_5thns_eq.incrd -r eq.rst.001 -x eq.mdcrd.001 -inf
eq.mdinfo.001 -p 1-388_stripped.top
-O -rem 0 -i eq.in.002 -o eq.out.002 -c
baslakcross344.7K_100thns_eq.incrd -r eq.rst.002 -x eq.mdcrd.002 -inf
eq.mdinfo.002 -p 1-388_stripped.top
-O -rem 0 -i eq.in.003 -o eq.out.003 -c
baslakcross344.7K_180thns_eq.incrd -r eq.rst.003 -x eq.mdcrd.003 -inf
eq.mdinfo.003 -p 1-388_stripped.top
-O -rem 0 -i eq.in.004 -o eq.out.004 -c
baslakcross344.7K_225thns_eq.incrd -r eq.rst.004 -x eq.mdcrd.004 -inf
eq.mdinfo.004 -p 1-388_stripped.top
-O -rem 0 -i eq.in.005 -o eq.out.005 -c
baslakcross344.7K_495thns_eq.incrd -r eq.rst.005 -x eq.mdcrd.005 -inf
eq.mdinfo.005 -p 1-388_stripped.top
-O -rem 0 -i eq.in.006 -o eq.out.006 -c
baslakcross344.7K_650thns_eq.incrd -r eq.rst.006 -x eq.mdcrd.006 -inf
eq.mdinfo.006 -p 1-388_stripped.top
-O -rem 0 -i eq.in.007 -o eq.out.007 -c baslakcross344.7K_c0_eq.incrd
-r eq.rst.007 -x eq.mdcrd.007 -inf eq.mdinfo.007 -p 1-388_stripped.top
-O -rem 0 -i eq.in.008 -o eq.out.008 -c baslakcross344.7K_c1_eq.incrd
-r eq.rst.008 -x eq.mdcrd.008 -inf eq.mdinfo.008 -p 1-388_stripped.top
-O -rem 0 -i eq.in.009 -o eq.out.009 -c baslakcross344.7K_c2_eq.incrd
-r eq.rst.009 -x eq.mdcrd.009 -inf eq.mdinfo.009 -p 1-388_stripped.top
-O -rem 0 -i eq.in.010 -o eq.out.010 -c baslakcross344.7K_c3_eq.incrd
-r eq.rst.010 -x eq.mdcrd.010 -inf eq.mdinfo.010 -p 1-388_stripped.top

#!/bin/bash
#SBATCH -p barbun-cuda
#SBATCH -A uucar
#SBATCH -J 1-388remd201
#SBATCH -N 1
#SBATCH -n 20
#SBATCH --gres=gpu:2
#SBATCH --time=1-00:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH -D
/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
export

WORKDIR=/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
cd $WORKDIR
module load centos7.3/lib/openmpi/1.8.8-gcc-4.8.5
module load centos7.3/lib/cuda/9.0
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
mpirun $AMBER_DIR/bin/pmemd.cuda.MPI -ng 10 -groupfile deneme_eq.groupfile

Amber produced an output file but the output terminated after writing
the result of the 0th step:
  NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS
= 0.0
  Etot = -134342.6908 EKtot = 0.0000 EPtot =
-134342.6908
  BOND = 19102.2338 ANGLE = 3092.7537 DIHED =
4080.3001
  1-4 NB = 1347.0833 1-4 EEL = 17091.3523 VDWAALS =
17272.2203
  EELEC = -196328.6342 EHBOND = 0.0000 RESTRAINT =
    0.0000

------------------------------------------------------------------------------

The error message given by the system is:
SLURM_NODELIST barbun122
NUMBER OF CORES 20
  Running multipmemd version of pmemd Amber16
     Total processors = 20
     Number of groups = 10
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
was encountered
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

   Process name: [[13854,1],12]
   Exit code: 255

We also tried with 40 replicas, different number of CPUs (up to 80), 3
GPUs on 2 different machines, single GPU with pmemd.cuda (without
MPI). But none of them worked. With 40 replicas and 80 CPUs we
obtained the following eeror in addition to the ones mentioned above:
cudaMalloc GpuBuffer::Allocate failed out of memory

It seems from the manual and mailing list that Amber 16 supports REMD
on GPUs. So probably we make a mistake at some point. Any help will be
greatly appreciated.
Thank you.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ambermd.org_mailman_listinfo_amber&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=xDNajZvNzDN9walgsMlr6TYhxnMatrm2SNbGxd7ele0&e=

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ambermd.org_mailman_listinfo_amber&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=xDNajZvNzDN9walgsMlr6TYhxnMatrm2SNbGxd7ele0&e=

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ambermd.org_mailman_listinfo_amber&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=xDNajZvNzDN9walgsMlr6TYhxnMatrm2SNbGxd7ele0&e=

--
Prof. Alessandro Contini, PhD
Dipartimento di Scienze Farmaceutiche
Sezione di Chimica Generale e Organica "A. Marchesini"
Via Venezian, 21 (edificio 5. corpo A, III piano) 20133 Milano
tel. +390250314480
e-mail alessandro.contini.unimi.it
skype alessandrocontini

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scopus.com_authid_detail.url-3FauthorId-3D7003441091&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=EiNONWU8Y7GG8drOyGwcEN8urAbo28jG6-3M6JD8MEU&e=
https://urldefense.proofpoint.com/v2/url?u=http-3A__orcid.org_0000-2D0002-2D4394-2D8956&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=iQLJk8WmAVA5Gksfk_sPTdFJQdF4AfGDcT-vgtcaeK8&e=
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.researcherid.com_rid_F-2D5064-2D2012&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=SjVJ8V5f5khjntefufo7Oz0BI7uWs91aH_x_i_9GwG4&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__loop.frontiersin.org_people_487422&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=yZaKW9DCOxk0KKxtec3d8jEZe-jhVLBrOYROZXr6jiM&e=

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ambermd.org_mailman_listinfo_amber&d=DwIGaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=vg8iTdivJL1PEwCvcH8P2DFF-Rtc9lAQvIqnaSWm1Pc&m=H0ZN4jUpUnOh3wJ55HlMjQijkCZiYPkKfkUgNHnVdOs&s=xDNajZvNzDN9walgsMlr6TYhxnMatrm2SNbGxd7ele0&e=
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 02 2020 - 09:00:02 PST