RE: AMBER: amber on AMD opteron-250

From: servaas michielssens <>
Date: Tue, 11 Dec 2007 21:27:20 +0100

On Tue, 2007-12-11 at 07:52 -0800, Ross Walker wrote:
> Hi Servaas,
> The problems with using gigabit ethernet (or any ethernet for that
> matter) as an interconnect for clusters have been discussed on this
> list a number of times but are worth repeating since it may benefit a
> number of people struggling with such setups.
> The main issue is that unlike "real" interconnects such as infiniband,
> myrinet, quadrics etc..., ethernet was never designed to be abused in
> the way that something like Amber does while running in parallel with
> MPI traffic over it. The issue comes about because ethernet has both
> an extremely bad system for dealling with packet loss due to
> congestion or lack of buffer space and when it does get overloaded it
> fails catastrophically, unlike interconnects that were designed to be
> low latency MPI carriersn which degrade gracefully when overloaded.
> The issue, apart from the fact that TCP/IP has massive cpu overhead
> and ethernet latency is aweful, is the fact that most of the ethernet
> cards and switches that people buy for clusters are complete junk. For
> example in your case I suspect most of the problems are a function of
> a cheap switch, cheap network cards or both. In terms of the switch a
> number of problems exist that are mainly a function of the average
> gigabit switch being designed for routing web traffic and email where
> latency and flow control is not essential. To save money the switch
> manufacturers install both a cheap backplane and small amounts of
> buffer space. For example it is typical for a 32 port gigabit switch
> to have only a 10 gigabit backplane - hence while each port can
> theoretically do 1 gigabit to each other port if they all try it at
> once your real bandwidth drops to 0.312 gigabits per port which in
> full duplex mode only gets you around 150 MBPs. Hence you should make
> sure any gigabit switch you get, if you insist on using gigabit
> ethernet, is non-blocking across all ports. Note this will probably
> add a zero on the end of the price.
> The second problem is that by default most switches either do not
> support flow control or have it turned off. What this means is that if
> the switch runs out of buffer space due to being overloaded, as Amber
> will easily do, you get massive packet loss rather than throttling.
> Each lost packet costs you between 500 and 2000ms which as you can
> expect instantly destroys any parallel scaling. You can try turning on
> flow control on the switch if it supports it although the ethernet
> cards need to support it as well so you likely need server quality
> cards, that do proper offloading of cpu work. I suspect the fact that
> most cards do not support it is why most switches that support flow
> control have it turned off by default.
> You can check if it is a switch only problem by getting a crossover
> cable and hooking two machines up together and see what happens. If
> things get better then it is your switch that is causing problems, if
> it doesn't then it could be the cards or both.
> Even with a decent set of cards, switch and a completely seperate
> network for MPI and NFS traffic I wouldn't expect much out of
> ethernet. The issue is many fold. Firstly Amber itself has improved in
> performance tremendously over the years, just see the amber website
> for a great example by Dave Case. Thus the communications required per
> second for a given processor has increased with each new version of
> Amber. Next cpus are significantly faster today and there are more in
> a box than their used to be. I.e. if you used to have a single 1GHz
> chip in a box with a single gigabit card then your communication
> bandwidth was 1 gigabit per GHz. Now consider the fact that people
> have dual x dual core 3GHz chips. They still put just one gigabit
> ethernet card in a box so your bandwidth now is around 80MBPs per GHZ.
> Quite a difference. The real issue though is that at 1GHz single chips
> the computation speed was such that the interconnect was never maxed
> out, the switch never ran out of buffer space and everything worked
> great. Now a single box easily maxes out the switch and causes a
> catastrophic failure due to packet loss so things look much worse than
> the 80MBPs per GHZ would suggest.
> Short answer:
> 1) Try flow control if your hardware supports it.
> 2) Try putting an extra server quality NIC in each box and hooking
> them up with cross over cables - this will at least (hopefully) allow
> you to run across two boxes. If you are really clever and put 2 extra
> NICs in each box then you might be able to get to 3 boxes although you
> will have to be clever in how you configure the MPI stack.
> Beyond this there is not much that you can do short of switching to a
> better interconnect.
> I'm not sure what you mean by with Intel 100MBps there is normal
> scaling. Are these much much slower chips? Or perhaps just a better
> switch with flow control.
I mean that we have another cluster with intel processors and there
there is no problem if I use more then 4 processors. The communicaction
there is with a 100Mbps ethernet

> Note the reason Turbomol works okay is that it does far more
> computation per communication event. Hence it doesn't drive the switch
> so hard and doesn't cause it to loose packets due to overload.
> I hope this helps. I am afraid there is no silver bullet here and the
> laws of physics are against you...
Ok thanks, I will discus your suggestions with the system administrator.
The university sypercomputer has also AMD opteron processors with beter
interconection, I will trie to run some test there to compare.

thanks to all for the help


> Good luck,
> Ross
> /\
> \/
> |\oss Walker
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- |
> | | PGP Key available on request |
> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not be read every day, and should not be used for urgent or sensitive
> issues.
> ______________________________________________________________
> From: []
> On Behalf Of servaas michielssens
> Sent: Tuesday, December 11, 2007 01:35
> To:
> Subject: Re: AMBER: amber on AMD opteron-250
> More info:
> 2cpu per node
> gigabit ethernet network connection
> few NFS traffic (but there is some)
> Other programs work normal (e.g. turbomol, this scales ok till
> 16 cpus) but gromacs for example shows the same behaviour as
> amber (and is slower on AMD).
> So my main problem is the jump when you take more than 4
> cpu's, calculations are faster on 4 cpu's than 8. Scaling from
> 2 to 4 is ok, but the main problem is more than 4 cpus. Any
> suggestions there?
> (With intel 100Mbit ethernet network there is a normal
> scaling)
> kind regards,
> servaas
> ----- Original Message -----
> From: Robert Duke
> To:
> Sent: Wednesday, December 05, 2007 10:31 PM
> Subject: Re: AMBER: amber on AMD opteron-250
> No, it should not be that bad, even for gigabit
> ethernet, presuming this is a more-or-less standard
> pme run. If I run pmemd 8, JAC benchmark (pme, nve
> simulation, 500 steps, ~23K atoms) on my two intel
> xeon 3.2 GHz dual cpu workstations connected with an
> XO cable, GB ethernet, server nics, I get the
> following runtimes:
> # procs wallclock sec
> 1 186
> 2 113
> 4 64
> The 3.2 GHz xeons and opterons really have pretty
> similar performance.
> So if you look at the 2 --> 4 processor performance,
> it comes pretty close to doubling. The 1-->2
> processor performance typically does not for small
> dual core nodes; this is a matter typically of shared
> cache and other sharing effects, as well as the fact
> that there is a ton of overhead in the parallelization
> code that has maximum impact and minimum benefit at 2
> cpu's (and the single cpu code has none of this - it
> is essentially a separate implementation, optimized
> for the single processor). You don't show single
> processor performance at all though. PMEMD 9
> performance is even better. So you have other things
> going on.
> Regards - Bob
> ----- Original Message -----
> From: David LeBard
> To:
> Sent: Wednesday, December 05, 2007 3:29 PM
> Subject: Re: AMBER: amber on AMD opteron-250
> Hi Servaas,
> This is generally due to your network, which
> you did not mention so I assume we are talking
> about the gigabit ethernet, and to the number
> of CPU's per node, which also you neglected to
> specify. However, with my experience on dual
> CPU opterons (240's and 248's) and a gigabit
> ethernet these numbers seem about right.
> Unfortunately you may only be able to get good
> scaling for 20k atoms upto 32 CPUs, but only
> if you have a faster network like infiniband
> or myirnet or the like.
> Good luck,
> David LeBard
> On 12/5/07, servaas michielssens
> < >
> wrote:
> I ran a 20ps simulation of a system of
> 20000 atoms on an AMD opteron 250
> cluster with 8 processors, I used
> amber8 and pmemd for the simulation. I
> found some strange results:
> proc time(min)
> 2 31
> 3 29
> 4 20
> 5 23
> 6 24
> 7 20
> 8 21
> 4 processors gives the optimum, it
> seems to be independent of how I
> adress the processors. So for 5
> processors 1-2-3-4-5 or 1-2-3-4-7
> gives
> the same results, always on for
> processors there is an optimum.
> Anyone
> who experienced this scaling problem?
> kind regards,
> servaas michielssens
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to
> To unsubscribe, send "unsubscribe
> amber" to

The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to
Received on Wed Dec 12 2007 - 06:07:33 PST
Custom Search