Re: [AMBER] A problem with a specific input in pmemd.cuda.MPI. 4 GPU fails, 2 GPU works fine. from Ross Walker on 2013-10-04 (Amber Archive Oct 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 04 Oct 2013 11:41:17 -0700

Hi Jeremy,

A couple of requests. First can you check that the very latest version of
the code is being used. In the mdout file it should say:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
| Version 12.3.1
|
| 08/07/2013
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)

If the version is not 12.3.1 then please run the configure script to
update AMBER to the latest version and recompile and then try again.

Next check to be sure it isn't a flakey GPU. Try running this on a
different node and see if you see the same failure. Alternatively try
running the 2 GPU job on the 'other' two GPUs in the node and see if that
fails.

If that is all good then please send me the inputs etc so we can try to
replicate the failure.

Note, though that currently multi-GPU runs do not scale very well (this
will be improved significantly in the next version of AMBER) - so for now
unless there is a really compelling reason to get as much speed as
possible on a single simulation a better approach is to run 4 simulations
at once, one on each GPU with different random seeds, starting structures
or even completely different simulations. AMBER is designed in such a way
(different from the other MD codes out there) that each GPU run will not
interfere with other GPU runs so your aggregate performance with 4 jobs on
4 GPUs will be 4 x the performance on 1 GPU.

That said the parallel code should not crash.

All the best
Ross

On 10/4/13 11:26 AM, "Jeremy Hallum" <jhallum.umich.edu> wrote:

>Hi all,
>
>
>I have a user who is using Amber12 with AmberTools13 (completely patched
>and up to date). He is trying to run a 4 GPU job, and he is getting a
>very specific error:
>
>gpu_download_partial_forces: download failed unspecified launch failure
>
>==========================================================================
>=========
>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>= EXIT CODE: 255
>= CLEANING UP REMAINING PROCESSES
>= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>==========================================================================
>=========
>
>When the user runs the same code on 2 GPU, the code runs fine. The error
>exists whether we compile amber against cuda 5.0.35 or 5.5.22. The
>nodes are 16 core nodes, with 64 GB of RAM and 4 GTX 580 cards. We have
>5 of these types of nodes, and the error occurs on any of them, not just
>one of them.
>
>The software stack we are using is:
>
>gcc/4.6.4
>mvapich2/1.9b
>cuda 5.0.35 or cuda 5.5.22
>
>The user is running the following commands:
>
>----
>
>AMBER=$AMBERHOME/bin/pmemd.cuda.MPI
>MPIRUN=$MPI_HOME/bin/mpirun
>
>prv=01
>cur=02
>
>AMBER_ARGS="-O -i dyna.01.sander -p sys_box.prmtop -c dyna.$prv.rst
>-o dyna.$cur.out -r dyna.$cur.rst -x dyna.$cur.traj.nc -inf dyna.$cur.inf"
>
>$MPIRUN -np $NPROCS $AMBER $AMBER_ARGS
>
>----
>
>
>I can provide the inputs on request, let me know where you'd like me to
>email them to. Can you give me any clues to look for to help solve the
>problem? Let me know if there are any additional pieces of information
>I can give.
>
>Thanks for any help you can give.
>
>-jeremy
>--
>Jeremy Hallum
>Computational Research Consulting Division
>Medical School Information Services
>University of Michigan
>jhallum.umich.edu
>
>
>
>
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 04 2013 - 12:00:05 PDT