AMBER: problem running parallel jobs

From: Nikola Trbovic <nt2146.columbia.edu>
Date: Tue, 05 Jun 2007 16:43:02 -0400

Dear all,

I'm having problems running pmemd and sander with mpi on more than 2
nodes over gigabit ethernet. Shortly after starting the job, one of the
nodes (which one is random) reports a network error associated with the
tg3 driver:

tg3: eth0: transmit timed out, resetting
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
...

This node then disappears from the network for a couple of minutes and
the job stalls, although it doesn't terminate.

Running 4 processes on one node, or even 8 on two nodes works fine,
however. I've tried using mpich2 and mpich, with fftw and without - it
made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
this all indicates that it is not a problem with amber, but instead with
my OS/tg3 driver. But I was wondering if anybody had experienced the
same previously and could give advice on how to fix it.

Thanks a lot in advance,
Nikola Trbovic

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jun 06 2007 - 06:07:36 PDT
Custom Search