[AMBER] problem with pmemd.cuda and PBS from Indrajit Deb on 2016-01-11 (Amber Archive Jan 2016)

From: Indrajit Deb <biky2004indra.gmail.com>
Date: Mon, 11 Jan 2016 14:52:32 +0100

Dear All,

I am submitting several pmemd.cuda jobs through PBS in a supercomputing
facility. The facility have several nodes with two Tesla M2050 GPUs. I am
using the following PBS script according to the instructions given by the
administrator....

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=1:gpu2
#PBS -l mem=62gb
#PBS -l walltime=48:00:00
# Set the path and loads the appropriate modules
module load amber/14
# Lustre file system - shared between nodes
mkdir -p /tmp/lustre_shared/$USER/$PBS_JOBID
export TMPDIR=/tmp/lustre_shared/$USER/$PBS_JOBID
# Move to the WORKING DIRECTORY
cd $PBS_O_WORKDIR
# Copy input data in the variable $TMPDIR
cp prodrun2.in $TMPDIR
cp rna.prmtop $TMPDIR
cp prodrun1.rst $TMPDIR
cp DIST.rst $TMPDIR
# Move to the $TMPDIR
cd $TMPDIR
pmemd.cuda -O -i prodrun2.in -o prodrun2.out -p rna.prmtop -c prodrun1.rst
-r prodrun2.rst -x prodrun2.nc -ref prodrun1.rst -inf prodrun2.info
# We finish the calculation and transfer files to working directory
cp -r $TMPDIR $PBS_O_WORKDIR

Now the problem is that......sometimes unfortunately two jobs are going in
the same node and in the same GPU. The ouput is 21ns/day for each job. But
when single jobs is running the output is 42ns/day for that job.

I have checked with the command "nvidia-smi" as follows. Surprisingly,
another GPU is free.

+------------------------------------------------------+

| NVIDIA-SMI 346.59 Driver Version: 346.59 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |

|===============================+======================+======================|

| 0 Tesla M2050 On | 0000:02:00.0 Off |
0 |

| N/A N/A P1 N/A / N/A | 7MiB / 2687MiB | 0%
Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla M2050 On | 0000:04:00.0 Off |
0 |

| N/A N/A P0 N/A / N/A | 341MiB / 2687MiB | 99%
Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU
Memory |

| GPU PID Type Process name Usage
|

|=============================================================================|

| 1 1325 C pmemd.cuda
166MiB |

| 1 1326 C pmemd.cuda
166MiB |

+-----------------------------------------------------------------------------+

Another problem is that.........if we use the following PBS script
(ppn=1).....

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=1:gpu2
#PBS -l mem=62gb
#PBS -l walltime=48:00:00
.............................................
.............................................
export export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 pmemd.cuda.MPI

There is an error like following........

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
pmemd.cuda.MPI

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------

But if we use as following (ppn=2)........

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=2:gpu2
#PBS -l mem=62gb
#PBS -l walltime=48:00:00
.............................................
.............................................
export export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 pmemd.cuda.MPI.............

Jobs are running but seems to be in CPUs. output is 2ns/day. But the
mdrun.out is showing that the jobs are running in GPUs. when we check with
"nvidia-smi", there is only 2% performance of each GPU.

Finally, We found that....
1. the PBS script is not working properly. We surprised that why two
different jobs from the queue (of same user or different user) are going to
the same node and same GPU.

2. why jobs are running in CPUs with the following.....

#PBS -l nodes=1:ppn=2:gpu2
export export CUDA_VISIBLE_DEVICES=0,1
mpirun -np 2 pmemd.cuda.MPI.............

We found that the only solve is to run the jobs directly entering into each
node and without any PBS script (as instructed in http://ambermd.org/gpus/)
as follows.....

module load amber/14
export CUDA_VISIBLE_DEVICES="1" or "0"
nohup pmemd.cuda -O -i prodrun1.in -o prodrun2.out -p rna.prmtop -c
prodrun1.rst -x prodrun2.nc -r prodrun2.rst -inf prodrun2.inf &

But we are actually not allowed to run the jobs directly. We should use PBS
script.

Please help

----indrajit

--------------------------------------------------------------------------------------------------------------
*Indrajit Deb*
alternate emails: indrajitdeb81.gmail.com, idbmbg_s.caluniv.ac.in
*Present Position*
International Centre for Genetic Engineering and Biotechnology (ICGEB,
Italy) SMART Fellow,
Department of Structural Chemistry and Biology of Nucleic Acids,
Institute of Bioorganic Chemistry (IBCh),
Polish Academy of Sciences (PAS).
European Center for Bioinformatiocs and Genetics (ECBiG) Campus (
R
oom: 2.6.28
).
Z. Noskowskiego Str. 12/14.
Poznan 61-704, Poland.
Phone: +48616653042, Personal Mobile: +48662513522

*Previous Position*
Ph.D Student,
Department of Biophysics, Molecular Biology and Bioinformatics, University
of Calcutta (CU), 92 APC Road, Kolkata - 700009, India. Phone: +913323508386
(extn. 329, 321), Fax: +913323519755. Personal Mobile: +919239202278
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jan 11 2016 - 06:00:04 PST