RE: AMBER: amber on AMD opteron-250 from Ross Walker on 2007-12-11 (Amber Archive Dec 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 11 Dec 2007 07:52:07 -0800

Hi Servaas,

The problems with using gigabit ethernet (or any ethernet for that matter)
as an interconnect for clusters have been discussed on this list a number of
times but are worth repeating since it may benefit a number of people
struggling with such setups.

The main issue is that unlike "real" interconnects such as infiniband,
myrinet, quadrics etc..., ethernet was never designed to be abused in the
way that something like Amber does while running in parallel with MPI
traffic over it. The issue comes about because ethernet has both an
extremely bad system for dealling with packet loss due to congestion or lack
of buffer space and when it does get overloaded it fails catastrophically,
unlike interconnects that were designed to be low latency MPI carriersn
which degrade gracefully when overloaded.

The issue, apart from the fact that TCP/IP has massive cpu overhead and
ethernet latency is aweful, is the fact that most of the ethernet cards and
switches that people buy for clusters are complete junk. For example in your
case I suspect most of the problems are a function of a cheap switch, cheap
network cards or both. In terms of the switch a number of problems exist
that are mainly a function of the average gigabit switch being designed for
routing web traffic and email where latency and flow control is not
essential. To save money the switch manufacturers install both a cheap
backplane and small amounts of buffer space. For example it is typical for a
32 port gigabit switch to have only a 10 gigabit backplane - hence while
each port can theoretically do 1 gigabit to each other port if they all try
it at once your real bandwidth drops to 0.312 gigabits per port which in
full duplex mode only gets you around 150 MBPs. Hence you should make sure
any gigabit switch you get, if you insist on using gigabit ethernet, is
non-blocking across all ports. Note this will probably add a zero on the end
of the price.

The second problem is that by default most switches either do not support
flow control or have it turned off. What this means is that if the switch
runs out of buffer space due to being overloaded, as Amber will easily do,
you get massive packet loss rather than throttling. Each lost packet costs
you between 500 and 2000ms which as you can expect instantly destroys any
parallel scaling. You can try turning on flow control on the switch if it
supports it although the ethernet cards need to support it as well so you
likely need server quality cards, that do proper offloading of cpu work. I
suspect the fact that most cards do not support it is why most switches that
support flow control have it turned off by default.

You can check if it is a switch only problem by getting a crossover cable
and hooking two machines up together and see what happens. If things get
better then it is your switch that is causing problems, if it doesn't then
it could be the cards or both.

Even with a decent set of cards, switch and a completely seperate network
for MPI and NFS traffic I wouldn't expect much out of ethernet. The issue is
many fold. Firstly Amber itself has improved in performance tremendously
over the years, just see the amber website for a great example by Dave Case.
Thus the communications required per second for a given processor has
increased with each new version of Amber. Next cpus are significantly faster
today and there are more in a box than their used to be. I.e. if you used to
have a single 1GHz chip in a box with a single gigabit card then your
communication bandwidth was 1 gigabit per GHz. Now consider the fact that
people have dual x dual core 3GHz chips. They still put just one gigabit
ethernet card in a box so your bandwidth now is around 80MBPs per GHZ. Quite
a difference. The real issue though is that at 1GHz single chips the
computation speed was such that the interconnect was never maxed out, the
switch never ran out of buffer space and everything worked great. Now a
single box easily maxes out the switch and causes a catastrophic failure due
to packet loss so things look much worse than the 80MBPs per GHZ would
suggest.

Short answer:

1) Try flow control if your hardware supports it.

2) Try putting an extra server quality NIC in each box and hooking them up
with cross over cables - this will at least (hopefully) allow you to run
across two boxes. If you are really clever and put 2 extra NICs in each box
then you might be able to get to 3 boxes although you will have to be clever
in how you configure the MPI stack.

Beyond this there is not much that you can do short of switching to a better
interconnect.

I'm not sure what you mean by with Intel 100MBps there is normal scaling.
Are these much much slower chips? Or perhaps just a better switch with flow
control.

Note the reason Turbomol works okay is that it does far more computation per
communication event. Hence it doesn't drive the switch so hard and doesn't
cause it to loose packets due to overload.

I hope this helps. I am afraid there is no silver bullet here and the laws
of physics are against you...

Good luck,

Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_____

From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of
servaas michielssens
Sent: Tuesday, December 11, 2007 01:35
To: amber.scripps.edu
Subject: Re: AMBER: amber on AMD opteron-250

More info:

2cpu per node
gigabit ethernet network connection
few NFS traffic (but there is some)
Other programs work normal (e.g. turbomol, this scales ok till 16 cpus) but
gromacs for example shows the same behaviour as amber (and is slower on
AMD).

So my main problem is the jump when you take more than 4 cpu's, calculations
are faster on 4 cpu's than 8. Scaling from 2 to 4 is ok, but the main
problem is more than 4 cpus. Any suggestions there?
(With intel 100Mbit ethernet network there is a normal scaling)

kind regards,

servaas

----- Original Message -----
From: Robert <mailto:rduke.email.unc.edu> Duke
To: amber.scripps.edu
Sent: Wednesday, December 05, 2007 10:31 PM
Subject: Re: AMBER: amber on AMD opteron-250

No, it should not be that bad, even for gigabit ethernet, presuming this is
a more-or-less standard pme run. If I run pmemd 8, JAC benchmark (pme, nve
simulation, 500 steps, ~23K atoms) on my two intel xeon 3.2 GHz dual cpu
workstations connected with an XO cable, GB ethernet, server nics, I get
the following runtimes:

# procs wallclock sec
1 186
2 113
4 64

The 3.2 GHz xeons and opterons really have pretty similar performance.

So if you look at the 2 --> 4 processor performance, it comes pretty close
to doubling. The 1-->2 processor performance typically does not for small
dual core nodes; this is a matter typically of shared cache and other
sharing effects, as well as the fact that there is a ton of overhead in the
parallelization code that has maximum impact and minimum benefit at 2 cpu's
(and the single cpu code has none of this - it is essentially a separate
implementation, optimized for the single processor). You don't show single
processor performance at all though. PMEMD 9 performance is even better.
So you have other things going on.
Regards - Bob

----- Original Message -----
From: David <mailto:david.lebard.asu.edu> LeBard
To: amber.scripps.edu
Sent: Wednesday, December 05, 2007 3:29 PM
Subject: Re: AMBER: amber on AMD opteron-250

Hi Servaas,

This is generally due to your network, which you did not mention so I assume
we are talking about the gigabit ethernet, and to the number of CPU's per
node, which also you neglected to specify. However, with my experience on
dual CPU opterons (240's and 248's) and a gigabit ethernet these numbers
seem about right. Unfortunately you may only be able to get good scaling
for 20k atoms upto 32 CPUs, but only if you have a faster network like
infiniband or myirnet or the like.

Good luck,
David LeBard

On 12/5/07, servaas michielssens <servaas.michielssens.student.kuleuven.be
<mailto:servaas.michielssens.student.kuleuven.be> > wrote:

I ran a 20ps simulation of a system of 20000 atoms on an AMD opteron 250
cluster with 8 processors, I used amber8 and pmemd for the simulation. I
found some strange results:
proc time(min)
2 31
3 29
4 20
5 23
6 24
7 20
8 21

4 processors gives the optimum, it seems to be independent of how I
adress the processors. So for 5 processors 1-2-3-4-5 or 1-2-3-4-7 gives
the same results, always on for processors there is an optimum. Anyone
who experienced this scaling problem?

kind regards,

servaas michielssens

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Dec 12 2007 - 06:07:31 PST