Re: AMBER: problem running parallel jobs

From: Robert Konecny <rok.ucsd.edu>
Date: Tue, 5 Jun 2007 14:34:44 -0700

Hi Nikola,

try to disable the tcp segmentation offload on your eth0:

/usr/sbin/ethtool -K eth0 tso off

some versions of the tg3 driver choke on heavier traffic.

robert



On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
> Dear all,
>
> I'm having problems running pmemd and sander with mpi on more than 2
> nodes over gigabit ethernet. Shortly after starting the job, one of the
> nodes (which one is random) reports a network error associated with the
> tg3 driver:
>
> tg3: eth0: transmit timed out, resetting
> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> ...
>
> This node then disappears from the network for a couple of minutes and
> the job stalls, although it doesn't terminate.
>
> Running 4 processes on one node, or even 8 on two nodes works fine,
> however. I've tried using mpich2 and mpich, with fftw and without - it
> made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
> this all indicates that it is not a problem with amber, but instead with
> my OS/tg3 driver. But I was wondering if anybody had experienced the
> same previously and could give advice on how to fix it.
>
> Thanks a lot in advance,
> Nikola Trbovic
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jun 06 2007 - 06:07:37 PDT
Custom Search