RE: AMBER: problem running parallel jobs from Andy Purkiss-Trew on 2007-06-07 (Amber Archive Jun 2007)

From: Andy Purkiss-Trew <a.purkiss.mail.cryst.bbk.ac.uk>
Date: Thu, 07 Jun 2007 10:53:34 +0100

One other point to make, which I haven't seen mentioned in the other
responses is the number of atoms per node. At 2 nodes, you have ~2000
atoms per processor, at 8 nodes you'll only have ~500. There will always
be a limit in speed up as the numbers of atoms get less per node.

Try benchmarking the speed up with a system ten times the size and you
should find better speed ups with 4/8 nodes as a greater proportion of
time is spend number crunching compared to communicating.

I've seen similar problems with sander on a Cray (some time ago now)
when running from 8 - 128 processors!! There will always be a best
number of processors for a given job and this will vary depending on the
number of atoms and the type of simulation being run.

I would suggest that for your system system you will get better results
by running on less nodes but running more simulations at the same time.
i.e. run four jobs simultaneously on 2 nodes each, rather than one job
on 8.

Hope this is of some help (and is still right for the current PMEMD
which I've not used in anger on our cluster yet)

On Wed, 2007-06-06 at 13:31 -0400, Nikola Trbovic wrote:
> Thanks a lot Robert! That solved it! No more network errors!
>
> Now I've performed a few pmemd benchmarks (explicit water, ~16000 atoms)
> over night and obtained terrible performance. First of all let me repeat
> that I'm running on a gigabit cluster with four cores per node. Here are the
> benchmark results:
>
> Cores Nodes Time
> 4 1 16113
> 8 2 10251
> 16 4 24143
> 32 8 48138
>
> The odd thing is that while a 2-node job still achieves a 1.6-fold speedup
> over a single node job, a 4-node job achieves no speedup at all but instead
> takes more than twice as long as a 2-node job, and an 8-node job four times
> as long. So above 2 nodes performance scales linearly with the number of
> nodes! I've read the recent note on pushing the limits with gigabit and
> multiple cores, but I haven't seen any benchmarks reporting such an extreme
> drop in performance. I will run new benchmarks after increasing the network
> buffers and checking my switch settings, but I still wanted to make sure
> that this type of performance scaling is not perhaps indicative of remaining
> problems with my network drivers, mpich2 installation or amber installation.
> I am using NFS on the cluster, and the trajectories were being saved through
> NFS on the head node. From the latest note on gigabit parallel computing it
> sounds like that is a really bad idea. Could it explain the observed
> scaling?
>
> Thanks again to Robert, and in advance for any thoughts about the scaling
> issue,
>
> Nikola
>
> -----Original Message-----
> From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of
> Robert Konecny
> Sent: Tuesday, June 05, 2007 5:35 PM
> To: amber.scripps.edu
> Subject: Re: AMBER: problem running parallel jobs
>
> Hi Nikola,
>
> try to disable the tcp segmentation offload on your eth0:
>
> /usr/sbin/ethtool -K eth0 tso off
>
> some versions of the tg3 driver choke on heavier traffic.
>
> robert
>
>
>
> On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
> > Dear all,
> >
> > I'm having problems running pmemd and sander with mpi on more than 2
> > nodes over gigabit ethernet. Shortly after starting the job, one of the
> > nodes (which one is random) reports a network error associated with the
> > tg3 driver:
> >
> > tg3: eth0: transmit timed out, resetting
> > tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> > ...
> >
> > This node then disappears from the network for a couple of minutes and
> > the job stalls, although it doesn't terminate.
> >
> > Running 4 processes on one node, or even 8 on two nodes works fine,
> > however. I've tried using mpich2 and mpich, with fftw and without - it
> > made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
> > this all indicates that it is not a problem with amber, but instead with
> > my OS/tg3 driver. But I was wondering if anybody had experienced the
> > same previously and could give advice on how to fix it.
> >
> > Thanks a lot in advance,
> > Nikola Trbovic
> >
> > -----------------------------------------------------------------------
> > The AMBER Mail Reflector
> > To post, send mail to amber.scripps.edu
> > To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

-- 
Cat, n.: Lapwarmer with built-in buzzer.
+----------------------------------------------------------------------+
| Andy Purkiss-Trew, School of Crystallography,Birkbeck College,London |
|           E-mail   a.purkiss.mail.cryst.bbk.ac.uk                    |
+----------------------------------------------------------------------+
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Sun Jun 10 2007 - 06:07:20 PDT