RE: AMBER: problem running parallel jobs from Nikola Trbovic on 2007-06-06 (Amber Archive Jun 2007)

From: Nikola Trbovic <nt2146.columbia.edu>
Date: Wed, 6 Jun 2007 13:31:16 -0400

Thanks a lot Robert! That solved it! No more network errors!

Now I've performed a few pmemd benchmarks (explicit water, ~16000 atoms)
over night and obtained terrible performance. First of all let me repeat
that I'm running on a gigabit cluster with four cores per node. Here are the
benchmark results:

Cores Nodes Time
4 1 16113
8 2 10251
16 4 24143
32 8 48138

The odd thing is that while a 2-node job still achieves a 1.6-fold speedup
over a single node job, a 4-node job achieves no speedup at all but instead
takes more than twice as long as a 2-node job, and an 8-node job four times
as long. So above 2 nodes performance scales linearly with the number of
nodes! I've read the recent note on pushing the limits with gigabit and
multiple cores, but I haven't seen any benchmarks reporting such an extreme
drop in performance. I will run new benchmarks after increasing the network
buffers and checking my switch settings, but I still wanted to make sure
that this type of performance scaling is not perhaps indicative of remaining
problems with my network drivers, mpich2 installation or amber installation.
I am using NFS on the cluster, and the trajectories were being saved through
NFS on the head node. From the latest note on gigabit parallel computing it
sounds like that is a really bad idea. Could it explain the observed
scaling?

Thanks again to Robert, and in advance for any thoughts about the scaling
issue,

Nikola

-----Original Message-----
From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of
Robert Konecny
Sent: Tuesday, June 05, 2007 5:35 PM
To: amber.scripps.edu
Subject: Re: AMBER: problem running parallel jobs

Hi Nikola,

try to disable the tcp segmentation offload on your eth0:

/usr/sbin/ethtool -K eth0 tso off

some versions of the tg3 driver choke on heavier traffic.

robert

On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
> Dear all,
>
> I'm having problems running pmemd and sander with mpi on more than 2
> nodes over gigabit ethernet. Shortly after starting the job, one of the
> nodes (which one is random) reports a network error associated with the
> tg3 driver:
>
> tg3: eth0: transmit timed out, resetting
> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> ...
>
> This node then disappears from the network for a couple of minutes and
> the job stalls, although it doesn't terminate.
>
> Running 4 processes on one node, or even 8 on two nodes works fine,
> however. I've tried using mpich2 and mpich, with fftw and without - it
> made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
> this all indicates that it is not a problem with amber, but instead with
> my OS/tg3 driver. But I was wondering if anybody had experienced the
> same previously and could give advice on how to fix it.
>
> Thanks a lot in advance,
> Nikola Trbovic
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Jun 10 2007 - 06:07:13 PDT