Re: [AMBER] Problem with GPU equilibration step [ cudaMalloc GpuBuffer::Allocate failed out of memory]

From: Jason Swails <>
Date: Wed, 9 Dec 2015 07:46:00 -0500

On Wed, Dec 9, 2015 at 12:42 AM, Yogeeshwar Ajjugal <> wrote:

> Dear amber users,
> Iam trying to run equilibration step in GPU but its showing the
> cudaMalloc error. Here iam attaching my Pbs script. please any help can be
> appreciated.
> #!/bin/bash
> #PBS -l nodes=1:ppn=16:GPU
> #PBS -l walltime=07:00:00:00
> #PBS -q GPUq
> #PBS -e err_""$PBS_JOBID
> #PBS -o out_""$PBS_JOBID
> #PBS -r n
> #PBS -V
> #PBS -M
> export OMP_NUM_THREADS=2
> echo PBS JOB id is $PBS_JOBID
> echo NPROCS is $NPROCS
> NRINGS=`cat $PBS_NODEFILE |sort|uniq|wc -l`
> echo NRINGS is $NRINGS
> NGPUS=`expr $NRINGS \* 2`
> echo NGPUS is $NGPUS
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
> setenv AMBERHOME /home/external/iith/yajjugal/programs/amber_gpu/amber12
> mpirun -machinefile $PBS_NODEFILE -np 16
> /home/external/iith/yajjugal/programs/amber_gpu/amber12/bin/pmemd.cuda -O
> -i step1.inp -o step1.out -r step1.rst -p ../input/pr
> mtop -c ../input/prmcrd -ref ../input/prmcrd
> mpirun -machinefile $PBS_NODEFILE -np 16
> /home/external/iith/yajjugal/programs/amber_gpu/amber12/bin/pmemd.cuda -O
> -i step2.inp -o step2.out -r step2.rst -p ../input/pr
> mtop -c step1.rst -ref step1.rst

‚ÄčThere are several things wrong here. First, pmemd.cuda is a serial
program, not a parallel one. What this script is doing is running 16 copies
of the same job on all of the available GPUs on the assigned compute node
(based on your script, it would seem each node has 2 GPUs). This is bad.
The output files from each copy will try and overwrite those from the other
copies, they will be competing for resources, etc. pmemd.cuda, like all
serial programs, is meant to be used *without* mpirun (or, if the cluster
requires mpirun be used, via "mpirun -np 1" to force only a single thread).

Second, pmemd.cuda.MPI (which is the parallel version of pmemd.cuda)
parallelizes across GPUs -- it does not use the CPUs for much computing.
As a result, you should use -np # where # is the number of GPUs you are
requesting, NOT the number of CPUs. Even if you used pmemd.cuda.MPI in
your above commands, it would try to use 16 GPUs. If your node only has 2,
then it will try to run 8 of the threads on a single GPU, which will ruin
performance. That said, I do not think the parallel performance for
pmemd.cuda.MPI is very impressive (Amber 14 I believe is substantially
better due to peer-to-peer support). You might be better off running
independent serial simulations, even if the parallel simulations work.


Jason M. Swails
Rutgers University
Postdoctoral Researcher
AMBER mailing list
Received on Wed Dec 09 2015 - 05:00:03 PST
Custom Search