Yes. The same calculation runs successfully on CPUs. And, standard MD
with one of these replicas runs also successfully. So, it seems that
the problem is not with our mdin, topology and groupfile files or cuda
installation. The problem arises only when we try to run REMD on GPUs.
Alıntı Carlos Simmerling <carlos.simmerling.gmail.com>:
> does this work if you test it on CPUs only instead of GPUs? It might help
> you check if all of the components of your script, groupfile and mdin are
> correct.
> also, have you run GPU standard MD using these same inputs (no REMD, just a
> single MD from one of these prmtop/inpcrd and same mdin)
>
> On Fri, Feb 28, 2020 at 7:15 AM <bbalta.itu.edu.tr> wrote:
>
>> Hello,
>> We are trying to run a replica exchange simulation using
>> pmemd.cuda.MPI in Amber 16. The cluster we use is composed of machines
>> with the following properties: 2 x NVIDIA Tesla P100 GPU, 384 GB
>> ECC2600 MHz RAM, 2x 20 core Intel Xeon Scalable 6148 CPUs.
>> We tried different number of replicas. The groupfile (for initial
>> equilibration) and slurm job submission script we used for 10 replicas
>> are as follows:
>> -O -rem 0 -i eq.in.001 -o eq.out.001 -c
>> baslakcross344.7K_5thns_eq.incrd -r eq.rst.001 -x eq.mdcrd.001 -inf
>> eq.mdinfo.001 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.002 -o eq.out.002 -c
>> baslakcross344.7K_100thns_eq.incrd -r eq.rst.002 -x eq.mdcrd.002 -inf
>> eq.mdinfo.002 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.003 -o eq.out.003 -c
>> baslakcross344.7K_180thns_eq.incrd -r eq.rst.003 -x eq.mdcrd.003 -inf
>> eq.mdinfo.003 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.004 -o eq.out.004 -c
>> baslakcross344.7K_225thns_eq.incrd -r eq.rst.004 -x eq.mdcrd.004 -inf
>> eq.mdinfo.004 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.005 -o eq.out.005 -c
>> baslakcross344.7K_495thns_eq.incrd -r eq.rst.005 -x eq.mdcrd.005 -inf
>> eq.mdinfo.005 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.006 -o eq.out.006 -c
>> baslakcross344.7K_650thns_eq.incrd -r eq.rst.006 -x eq.mdcrd.006 -inf
>> eq.mdinfo.006 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.007 -o eq.out.007 -c baslakcross344.7K_c0_eq.incrd
>> -r eq.rst.007 -x eq.mdcrd.007 -inf eq.mdinfo.007 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.008 -o eq.out.008 -c baslakcross344.7K_c1_eq.incrd
>> -r eq.rst.008 -x eq.mdcrd.008 -inf eq.mdinfo.008 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.009 -o eq.out.009 -c baslakcross344.7K_c2_eq.incrd
>> -r eq.rst.009 -x eq.mdcrd.009 -inf eq.mdinfo.009 -p 1-388_stripped.top
>> -O -rem 0 -i eq.in.010 -o eq.out.010 -c baslakcross344.7K_c3_eq.incrd
>> -r eq.rst.010 -x eq.mdcrd.010 -inf eq.mdinfo.010 -p 1-388_stripped.top
>>
>> #!/bin/bash
>> #SBATCH -p barbun-cuda
>> #SBATCH -A uucar
>> #SBATCH -J 1-388remd201
>> #SBATCH -N 1
>> #SBATCH -n 20
>> #SBATCH --gres=gpu:2
>> #SBATCH --time=1-00:00:00
>> #SBATCH --output=slurm-%j.out
>> #SBATCH --error=slurm-%j.err
>> #SBATCH -D
>> /truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
>> export
>>
>> WORKDIR=/truba_scratch/uucar/hsp70_amber/REMD_NUCLEOTIDE_FREE_201PR/NEW_TEMP/1-388
>> cd $WORKDIR
>> module load centos7.3/lib/openmpi/1.8.8-gcc-4.8.5
>> module load centos7.3/lib/cuda/9.0
>> echo "SLURM_NODELIST $SLURM_NODELIST"
>> echo "NUMBER OF CORES $SLURM_NTASKS"
>> mpirun $AMBER_DIR/bin/pmemd.cuda.MPI -ng 10 -groupfile deneme_eq.groupfile
>>
>> Amber produced an output file but the output terminated after writing
>> the result of the 0th step:
>> NSTEP = 0 TIME(PS) = 0.000 TEMP(K) = 0.00 PRESS
>> = 0.0
>> Etot = -134342.6908 EKtot = 0.0000 EPtot =
>> -134342.6908
>> BOND = 19102.2338 ANGLE = 3092.7537 DIHED =
>> 4080.3001
>> 1-4 NB = 1347.0833 1-4 EEL = 17091.3523 VDWAALS =
>> 17272.2203
>> EELEC = -196328.6342 EHBOND = 0.0000 RESTRAINT =
>> 0.0000
>>
>>
>> ------------------------------------------------------------------------------
>>
>> The error message given by the system is:
>> SLURM_NODELIST barbun122
>> NUMBER OF CORES 20
>> Running multipmemd version of pmemd Amber16
>> Total processors = 20
>> Number of groups = 10
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> gpu_allreduce cudaDeviceSynchronize failed an illegal memory access
>> was encountered
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun detected that one or more processes exited with non-zero
>> status, thus causing
>> the job to be terminated. The first process to do so was:
>>
>> Process name: [[13854,1],12]
>> Exit code: 255
>>
>>
>> We also tried with 40 replicas, different number of CPUs (up to 80), 3
>> GPUs on 2 different machines, single GPU with pmemd.cuda (without
>> MPI). But none of them worked. With 40 replicas and 80 CPUs we
>> obtained the following eeror in addition to the ones mentioned above:
>> cudaMalloc GpuBuffer::Allocate failed out of memory
>>
>>
>> It seems from the manual and mailing list that Amber 16 supports REMD
>> on GPUs. So probably we make a mistake at some point. Any help will be
>> greatly appreciated.
>> Thank you.
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 02 2020 - 01:30:02 PST