[AMBER] A problem with a specific input in pmemd.cuda.MPI. 4 GPU fails, 2 GPU works fine.

From: Jeremy Hallum <jhallum.umich.edu>
Date: Fri, 4 Oct 2013 14:26:52 -0400

Hi all,


I have a user who is using Amber12 with AmberTools13 (completely patched and up to date). He is trying to run a 4 GPU job, and he is getting a very specific error:

gpu_download_partial_forces: download failed unspecified launch failure

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 255
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

When the user runs the same code on 2 GPU, the code runs fine. The error exists whether we compile amber against cuda 5.0.35 or 5.5.22. The nodes are 16 core nodes, with 64 GB of RAM and 4 GTX 580 cards. We have 5 of these types of nodes, and the error occurs on any of them, not just one of them.

The software stack we are using is:

gcc/4.6.4
mvapich2/1.9b
cuda 5.0.35 or cuda 5.5.22

The user is running the following commands:

----
AMBER=$AMBERHOME/bin/pmemd.cuda.MPI 
MPIRUN=$MPI_HOME/bin/mpirun
prv=01 
cur=02
AMBER_ARGS="-O -i dyna.01.sander -p sys_box.prmtop -c dyna.$prv.rst 
-o dyna.$cur.out -r dyna.$cur.rst -x dyna.$cur.traj.nc -inf dyna.$cur.inf"
$MPIRUN -np $NPROCS $AMBER $AMBER_ARGS 
----
I can provide the inputs on request, let me know where you'd like me to email them to.    Can you give me any clues to look for to help solve the problem?   Let me know if there are any additional pieces of information I can give.
Thanks for any help you can give.  
-jeremy
--
Jeremy Hallum
Computational Research Consulting Division
Medical School Information Services
University of Michigan
jhallum.umich.edu
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 04 2013 - 11:30:08 PDT
Custom Search