Hi,
we had some similar problem while setting up a GPU cluster running SLURM
as the scheduler.
After properly configuring slurm, we solved our issue with REMD by using
this script:
#!/bin/bash
#SBATCH --job-name tremd_amber
#SBATCH --nodes=2
#SBATCH --gres=gpu:2 # number of GPUs per node
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=6 # number of replica in groupfile / number of
nodes
#SBATCH --output job.out
#SBATCH --error job.err
# start from launch dir
cd $SLURM_SUBMIT_DIR
# load amber module + mpi environment
source /amber18/amber18.sh
# Calculate the number of replica
NG=`cat remd.groupfile | wc -l`
# do remd (no backbone restraints production run)
mpirun --mca btl_tcp_if_include 10.0.0.0/24 -np $NG pmemd.cuda.MPI -ng
$NG -groupfile remd.groupfile
wait
exit
The "--mca btl_tcp_if_include 10.0.0.0/24" is needed for mpirun to work
properly on our cluster and should be updated accordingly to your network
Hope this helps,
Best regards
Alessandro
Il 02/03/2020 10:21, bbalta.itu.edu.tr ha scritto:
> Yes. The same calculation runs successfully on CPUs. And, standard MD
> with one of these replicas runs also successfully. So, it seems that
> the problem is not with our mdin, topology and groupfile files or cuda
> installation. The problem arises only when we try to run REMD on GPUs.
>
>
>
>
>
> Alıntı Carlos Simmerling <carlos.simmerling.gmail.com>:
>
>> does this work if you test it on CPUs only instead of GPUs? It might help
>> you check if all of the components of your script, groupfile and mdin are
>> correct.
>> also, have you run GPU standard MD using these same inputs (no REMD, just a
>> single MD from one of these prmtop/inpcrd and same mdin)
>>
>> On Fri, Feb 28, 2020 at 7:15 AM <bbalta.itu.edu.tr> wrote:
>>
>>> Hello,
>>> We are trying to run a replica exchange simulation using
>>> pmemd.cuda.MPI in Amber 16. The cluster we use is composed of machines
>>> with the following properties: 2 x NVIDIA Tesla P100 GPU, 384 GB
>>> ECC2600 MHz RAM, 2x 20 core Intel Xeon Scalable 6148 CPUs.
>>> We tried different number of replicas. The groupfile (for initial
>>> equilibration) and slurm job submission script we used for 10 replicas
>>> are as follows:
>>> -O -rem 0 -i eq.in.001 -o eq.out.001 -c
>>> baslakcross344.7K_5thns_eq.incrd -r eq.rst.001 -x eq.mdcrd.001 -inf
>>> eq.mdinfo.001 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.002 -o eq.out.002 -c
>>> baslakcross344.7K_100thns_eq.incrd -r eq.rst.002 -x eq.mdcrd.002 -inf
>>> eq.mdinfo.002 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.003 -o eq.out.003 -c
>>> baslakcross344.7K_180thns_eq.incrd -r eq.rst.003 -x eq.mdcrd.003 -inf
>>> eq.mdinfo.003 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.004 -o eq.out.004 -c
>>> baslakcross344.7K_225thns_eq.incrd -r eq.rst.004 -x eq.mdcrd.004 -inf
>>> eq.mdinfo.004 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.005 -o eq.out.005 -c
>>> baslakcross344.7K_495thns_eq.incrd -r eq.rst.005 -x eq.mdcrd.005 -inf
>>> eq.mdinfo.005 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.006 -o eq.out.006 -c
>>> baslakcross344.7K_650thns_eq.incrd -r eq.rst.006 -x eq.mdcrd.006 -inf
>>> eq.mdinfo.006 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.007 -o eq.out.007 -c baslakcross344.7K_c0_eq.incrd
>>> -r eq.rst.007 -x eq.mdcrd.007 -inf eq.mdinfo.007 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.008 -o eq.out.008 -c baslakcross344.7K_c1_eq.incrd
>>> -r eq.rst.008 -x eq.mdcrd.008 -inf eq.mdinfo.008 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.009 -o eq.out.009 -c baslakcross344.7K_c2_eq.incrd
>>> -r eq.rst.009 -x eq.mdcrd.009 -inf eq.mdinfo.009 -p 1-388_stripped.top
>>> -O -rem 0 -i eq.in.010 -o eq.out.010 -c baslakcross344.7K_c3_eq.incrd
>>> -r eq.rst.010 -x eq.mdcrd.010 -inf eq.mdinfo.010 -p 1-388_stripped.top
>>>
>>> #!/bin/bash
>>> #SBATCH -p barbun-cuda
>>> #SBATCH -A uucar
>>> #SBATCH -J 1-388remd201
>>> #SBATCH -N 1
>>> #SBATCH -n 20
>>> #SBATCH --gres=gpu:2
>>> #SBATCH --time=1-00:00:00
>>> #SBATCH --output=slurm-%j.out
>>> #SBATCH --error=slurm-%j.err
>>> #SBATCH -D
>>> /truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
>>> export
>>>
>>> WORKDIR=/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
>>> cd $WORKDIR
>>> module load centos7.3/lib/openmpi/1.8.8-gcc-4.8.5
>>> module load centos7.3/lib/cuda/9.0
>>> echo "SLURM_NODELIST $SLURM_NODELIST"
>>> echo "NUMBER OF CORES $SLURM_NTASKS"
>>> mpirun $AMBER_DIR/bin/pmemd.cuda.MPI -ng 10 -groupfile deneme_eq.groupfile
>>>
>>> Amber produced an output file but the output terminated after writing
>>> the result of the 0th step:
>>> NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS
>>> = 0.0
>>> Etot = -134342.6908 EKtot = 0.0000 EPtot =
>>> -134342.6908
>>> BOND = 19102.2338 ANGLE = 3092.7537 DIHED =
>>> 4080.3001
>>> 1-4 NB = 1347.0833 1-4 EEL = 17091.3523 VDWAALS =
>>> 17272.2203
>>> EELEC = -196328.6342 EHBOND = 0.0000 RESTRAINT =
>>> 0.0000
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> The error message given by the system is:
>>> SLURM_NODELIST barbun122
>>> NUMBER OF CORES 20
>>> Running multipmemd version of pmemd Amber16
>>> Total processors = 20
>>> Number of groups = 10
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>>> was encountered
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing
>>> the job to be terminated. The first process to do so was:
>>>
>>> Process name: [[13854,1],12]
>>> Exit code: 255
>>>
>>>
>>> We also tried with 40 replicas, different number of CPUs (up to 80), 3
>>> GPUs on 2 different machines, single GPU with pmemd.cuda (without
>>> MPI). But none of them worked. With 40 replicas and 80 CPUs we
>>> obtained the following eeror in addition to the ones mentioned above:
>>> cudaMalloc GpuBuffer::Allocate failed out of memory
>>>
>>>
>>> It seems from the manual and mailing list that Amber 16 supports REMD
>>> on GPUs. So probably we make a mistake at some point. Any help will be
>>> greatly appreciated.
>>> Thank you.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
--
Prof. Alessandro Contini, PhD
Dipartimento di Scienze Farmaceutiche
Sezione di Chimica Generale e Organica "A. Marchesini"
Via Venezian, 21 (edificio 5. corpo A, III piano) 20133 Milano
tel. +390250314480
e-mail alessandro.contini.unimi.it
skype alessandrocontini
http://www.scopus.com/authid/detail.url?authorId=7003441091
http://orcid.org/0000-0002-4394-8956
http://www.researcherid.com/rid/F-5064-2012
https://loop.frontiersin.org/people/487422
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 02 2020 - 03:00:01 PST