RE: AMBER: problem running parallel jobs

From: Nikola Trbovic <>
Date: Wed, 6 Jun 2007 13:31:16 -0400

Thanks a lot Robert! That solved it! No more network errors!

Now I've performed a few pmemd benchmarks (explicit water, ~16000 atoms)
over night and obtained terrible performance. First of all let me repeat
that I'm running on a gigabit cluster with four cores per node. Here are the
benchmark results:

Cores Nodes Time
4 1 16113
8 2 10251
16 4 24143
32 8 48138

The odd thing is that while a 2-node job still achieves a 1.6-fold speedup
over a single node job, a 4-node job achieves no speedup at all but instead
takes more than twice as long as a 2-node job, and an 8-node job four times
as long. So above 2 nodes performance scales linearly with the number of
nodes! I've read the recent note on pushing the limits with gigabit and
multiple cores, but I haven't seen any benchmarks reporting such an extreme
drop in performance. I will run new benchmarks after increasing the network
buffers and checking my switch settings, but I still wanted to make sure
that this type of performance scaling is not perhaps indicative of remaining
problems with my network drivers, mpich2 installation or amber installation.
I am using NFS on the cluster, and the trajectories were being saved through
NFS on the head node. From the latest note on gigabit parallel computing it
sounds like that is a really bad idea. Could it explain the observed

Thanks again to Robert, and in advance for any thoughts about the scaling


-----Original Message-----
From: [] On Behalf Of
Robert Konecny
Sent: Tuesday, June 05, 2007 5:35 PM
Subject: Re: AMBER: problem running parallel jobs

Hi Nikola,

try to disable the tcp segmentation offload on your eth0:

/usr/sbin/ethtool -K eth0 tso off

some versions of the tg3 driver choke on heavier traffic.


On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
> Dear all,
> I'm having problems running pmemd and sander with mpi on more than 2
> nodes over gigabit ethernet. Shortly after starting the job, one of the
> nodes (which one is random) reports a network error associated with the
> tg3 driver:
> tg3: eth0: transmit timed out, resetting
> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> ...
> This node then disappears from the network for a couple of minutes and
> the job stalls, although it doesn't terminate.
> Running 4 processes on one node, or even 8 on two nodes works fine,
> however. I've tried using mpich2 and mpich, with fftw and without - it
> made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
> this all indicates that it is not a problem with amber, but instead with
> my OS/tg3 driver. But I was wondering if anybody had experienced the
> same previously and could give advice on how to fix it.
> Thanks a lot in advance,
> Nikola Trbovic
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to
> To unsubscribe, send "unsubscribe amber" to
The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to

The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to
Received on Sun Jun 10 2007 - 06:07:13 PDT
Custom Search