Re: [AMBER] job failed for REMD in cluster

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 15 Oct 2012 21:06:32 +0200

Hi Albert,

The error message is very explicit I believe. -np XX must be a multiple of
-ng XX - hence if you have 11 replicas then you need to run with 11, 22,
33, 44, 55, 66 etc cores (best to check performance for different counts).
If you want to run with cuda then you will need 11, 22, 33 etc physical
GPUs (well connected). I would avoid running multiple replicas on
individual GPUs because you will likely get VERY poor performance, same
caveat applies to CPUs but you can often get away with it up to a point.

Good luck,

Ross






On 10/15/12 8:48 PM, "Albert" <mailmd2011.gmail.com> wrote:

>hello:
>
>I am trying to submit REMD jobs in cluster under amber 12 by command:
>
>.
>.
>.
>
>mpirun -np 64 $AMBERHOME/bin/pmemd.MPI -ng 11 -groupfile
>equilibrate.groupfile
>.
>.
>.
>
>but it said the following. It is OK in mimization steps.
>
>
>
>n385:15832] [[30377,0],3]-[[30377,1],53] mca_oob_tcp_msg_recv: readv
>failed: Connection reset by peer (104)
>[n388:04879] 63 more processes have sent help message help-mpi-api.txt /
>mpi-abort
>[n388:04879] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>all help / error messages
>--------------------------------------------------------------------------
>MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>with errorcode 1.
>
>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>You may or may not see output from other processes, depending on
>exactly when Open MPI kills them.
>--------------------------------------------------------------------------
>setup_groups: MPI size is not a multiple of -ng
>setup_groups: MPI size is not a multiple of -ng
>--------------------------------------------------------------------------
>mpirun has exited due to process rank 0 with PID 4946 on
>node n388 exiting without calling "finalize". This may
>have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>--------------------------------------------------------------------------
>setup_groups: MPI size is not a multiple of -ng
>[n388:04945] 3 more processes have sent help message help-mpi-api.txt /
>mpi-abort
>[n388:04945] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>all help / error messages
>
>
>
>thank you very much
>Albert
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Oct 15 2012 - 12:30:06 PDT
Custom Search