Dear amber users,
Maybe this is not the proper list to ask about this but I tried all
possible archives (mpich2 list as well) and found no answer to this. So,
I try to appeal at your experience with running mpi jobs
As I reported before, I compiled AMBER 10 (including PMEMD) with MPICH2
(intel compilers for both amber and mpich2, no root). I did this on one
node (named 06-01) in a local directory available through the network).
Everything seemed fine and the executables (both sander.MPI and pmemd)
are running nicely (also parallel performance of PMEMD is quite good) so
I was very happy. However, in the beginning I only tested on the node I
compiled 06-01 and on another one 06-02.
When I tried to run on a different node (05-02), I got an error:
mpiexec_node-05-02 (mpiexec 255): no msg recvd from mpd during version check
----------------------------command used
---------------------------------------------------------------------------------------------
${MPI_HOME}/bin/mpiexec -gdb -machinefile machines -n 4 \
${AMBERHOME}/exe/pmemd -O -i .............
------------------------------------------------------------------------------------------------------------------------------------------
Trying to disect this error, I started playing with the mpi deamons on
this node. I run mpd and mpdtrace for dignostic. To my surprise mpdtrace
did not report the name of the node (as it correctly did previously on
06-01 and 06-02). Instead I got "mpdtrace (mpdtrace 57): got eof on
console". The full error message (shown below) suggests a connection
problem from node-05-02 to itself. However I can do ssh with password
from 05-02 to itsself.
The nodes are AMD Opterons (05-02 is a 2 dual core CPU machine while
06-01 and 06-02 have 4 dual core CPUs). OS=Debian Linux. I should also
say that there are some differences in the kernel between the 05-02 node
and the 06 nodes.
Has anybody seen such a behavior before? If yes and need more details
please let know which details and I will provide them.
Best wishes
vlad
--full error message from mpdtrace -----
mpdtrace (mpdtrace 57): got eof on console
node-05-02_59965 (mpd_sockpair 226): connect 110 Connection timed out
node-05-02_59965 (mpd_sockpair 233): connect error with 110 Connection
timed out
node-05-02_59965 (mpd_sockpair 244): connect 22 Invalid argument
node-05-02_59965: mpd_uncaught_except_tb handling:
socket.error: (22, 'Invalid argument')
/scratch/node-06-01/cojocavd/Software/mpich2-1.0.7-install/bin/mpdlib.py
245 mpd_sockpair
raise socket.error, errinfo
/scratch/node-06-01/cojocavd/Software/mpich2-1.0.7-install/bin/mpdlib.py
802 create_single_mem_ring
self.lhsSock,self.rhsSock = mpd_sockpair()
/scratch/node-06-01/cojocavd/Software/mpich2-1.0.7-install/bin/mpdlib.py
848 enter_ring
rhsHandler=rhsHandler)
/scratch/node-06-01/cojocavd/Software/mpich2/bin/mpd 250 run
rhsHandler=self.handle_rhs_input)
/scratch/node-06-01/cojocavd/Software/mpich2/bin/mpd 1492 ?
mpd.run()
--
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
to majordomo.scripps.edu
Received on Sun Jul 20 2008 - 06:07:25 PDT