Re: [AMBER] problem with running pmemd.mpi on cluster from Daniel Roe on 2011-05-30 (Amber Archive May 2011)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Mon, 30 May 2011 09:27:19 -0400

Hi,

As your error message indicates, this is a problem with the cluster
itself and not with Amber. You should send a message to your system
administrator informing them of the error. Good luck!

-Dan

On Mon, May 30, 2011 at 9:14 AM, Zheng, Zhong <Zhong.Zheng.stjude.org> wrote:
> Hi all
>
> I'm trying to run pmemd.mpi on a cluster. It works sometimes, but a lot of times, it fails either quickly after I started the run or in the middle which is even more painful. This is the error message I usually get:
> [[56168,1],27][btl_openib_component.c:2948:handle_wc] from node027 to: node064 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 200293576 opcode 11084 vendor error 129 qp_idx 3
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself. You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
>
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>
> Below is some information about the host that raised the error and the
> peer to which it was connected:
>
> Local host: node027
> Local device: mlx4_0
> Peer host: node064
>
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 27 with PID 30957 on
> node node027 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
>
> I suspect it might has something to do with the "infinitband" setup on the cluster. Still is there something that I can do? Thanks a lot.
>
> Agnes
>
> ________________________________
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 30 2011 - 06:30:05 PDT