Re: AMBER: PMEMD Performance on Beowulf systems from Carlos Simmerling on 2003-12-19 (Amber Archive Dec 2003)

From: Carlos Simmerling <carlos.csb.sunysb.edu>
Date: Fri, 19 Dec 2003 09:53:12 -0500

We had gigabit network on both our dual athlons (1.6ghz)
and our dual Xeons. Scaling was much worse on the athlons
until we found that moving the network cards (Intel) to a
different slot made a huge difference for the athlon motherboards.
You should check this to see what the PCI bandwidth is on each
slot- for us they were not the same.
Carlos

----- Original Message -----
From: "Robert Duke" <rduke.email.unc.edu>
To: <amber.scripps.edu>
Sent: Thursday, December 18, 2003 11:35 PM
Subject: Re: AMBER: PMEMD Performance on Beowulf systems

> Stephen -
> Several points -
> 1) Gigabit ethernet is not particularly good for scaling. The numbers I
> published were on IBM blade clusters that had no other load on them, and
the
> gigabit interconnect was isolated from other net traffic. If you split
> across switches or have other things going on (ie., other jobs running
> anywhere on machines on the interconnect), performance tends to really
drop.
> This is all you can expect to happen from such a slow interconnect. A
real
> killer for dual athlons is to not take advantage of the dual processors;
> typically if you have gigabit ethernet you will get better performance
> through shared memory, and if one of the cpu's is being used for
something
> else, you can't do this.
> 2) LAM MPI in my hands is slower than MPICH, around 10% if I recollect,
> without extensive testing (ie., I probably only did the check on some
> athlons with a slow interconnect, but inferred that LAM was not
necessarily
> an improvement). Taking this into account, your xeon numbers are really
not
> very different than mine (you are 10% better at 8 cpu and 20% worse at 16
> cpu, roughly).
> 3) Our 1.6 GHz athlons are slower than our 2.4 GHz xeons. I like the
> athlons, but the xeons can take advantage of vectorizing sse2
instructions.
> I don't know what your athlons are, but am not surprised they are slower.
> Why they are scaling so badly, I would suspect to be loading, config, net
> cards, motherboards, or heaven only knows. Lots of things can be slow
(back
> to item 1).
> 4) I don't use the Portland Group compilers at all because I had problems
> with them a couple of years ago, and the company did absolutely nothing to
> help. Looked like floating point register issues. This probably is not
> still the case, but the point is that I don't know what performance one
> would expect. My numbers are from the Intel fortran compiler. There
could
> also be issues about how LAM was built, or MPICH if you change to that.
>
> You have to really bear in mind that with gigabit ethernet, you are at the
> absolute bottom of reasonable interconnects for this type of system, and
it
> does not take much at all for numbers to be twofold worse than the ones I
> published. My numbers are for isolated systems, good hardware, with the
mpi
> build carefully checked out, and with pmemd built with ifc, which is also
> well checked out.
>
> Regards - Bob Duke
>
> ----- Original Message -----
> From: <Stephen.Titmuss.csiro.au>
> To: <amber.scripps.edu>
> Sent: Thursday, December 18, 2003 10:19 PM
> Subject: AMBER: PMEMD Performance on Beowulf systems
>
>
> > Hello All,
> >
> > We have been testing PMEMD 3.1 on a 32 cpu (16x dual Athlon nodes)
> > cluster with a gigabit switch. The performance we have been seeing (in
> > terms of scaling to larger numbers of CPUs) is a bit disappointing when
> > compared to the figures released for PMEMD. For example, comparing
> > ps/day rates for the JAC benchmark (with the specified cutoff changes,
> > etc) on our cluster (left column) and those presented for a 2.4GHz Xeon
> > cluster also with a gigabit switch (right column) gives:
> >
> > athlon xeon
> > 1cpu: 108
> > 2cpu: 172 234
> > 4cpu: 239 408
> > 8cpu: 360 771
> > 16cpu: 419 1005
> > 32cpu: 417
> >
> > In general, in terms of wall clock time, we only see a parallel speedup
> > (c.f. 1cpu) of about 3.3 at 8 cpus and struggle to get much past 3.9
> > going to higher numbers of cpus. The parallel scaling presented for
> > other cluster machines appears to be much better. Has anyone else
> > achieved good parallel speedup on beowulf systems?
> >
> > Also, we are using the Portland f90 compiler and LAM in our setup - has
> > anyone experienced problems with this compiler or MPI library with
> > PMEMD?
> >
> > Thanks in advance,
> >
> > Stephen Titmuss
> >
> > CSIRO Health Sciences and Nutrition
> > 343 Royal Parade
> > Parkville, Vic. 3052
> > AUSTRALIA
> >
> > Tel: +61 3 9662 7289
> > Fax: +61 3 9662 7347
> > Email: stephen.titmuss.csiro.au
> > www.csiro.au www.hsn.csiro.au
> >
> > -----------------------------------------------------------------------
> > The AMBER Mail Reflector
> > To post, send mail to amber.scripps.edu
> > To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
> >
> >
>
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jan 14 2004 - 15:53:11 PST