RE: AMBER: amber on AMD opteron-250 from Christopher Jones on 2007-12-11 (Amber Archive Dec 2007)

From: Christopher Jones <cjones26.uic.edu>
Date: Tue, 11 Dec 2007 16:08:31 -0600

Hi Servaas,
I just ran into a similar problem recently. I benched JAC on our
University's linux cluster (mixed Xeon and Opteron) which uses Gigabit
NIC's. For 2ps nve (I used the md input file from the benchmarks folder in
the installation directory) I got the following (I selected unused Xeon
3.0GHz nodes only):

8 cpus on 4 nodes - 41:32 (min:sec)
4 cpus on 4 nodes - 38:40
4 cpus on 2 nodes - 36:49
2 cpus on 1 node - 57:16

This was with sander.MPI and MPICH2 with intel compilers. From this I
presumed that the Gigabit was holding things up. I sent an email to the
system admin to see if there was something we could adjust to improve things
a little bit, but I haven't heard anything yet.

Cheers,
Chris Jones

Btw. Thanks to Ross, recompiling MPICH did the trick.
http://amber.ch.ic.ac.uk/archive/200710/0324.html

-----Original Message-----
From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of
servaas michielssens
Sent: Tuesday, December 11, 2007 2:27 PM
To: amber.scripps.edu
Subject: RE: AMBER: amber on AMD opteron-250

On Tue, 2007-12-11 at 07:52 -0800, Ross Walker wrote:
> Hi Servaas,
>
> The problems with using gigabit ethernet (or any ethernet for that
> matter) as an interconnect for clusters have been discussed on this
> list a number of times but are worth repeating since it may benefit a
> number of people struggling with such setups.
>
> The main issue is that unlike "real" interconnects such as infiniband,
> myrinet, quadrics etc..., ethernet was never designed to be abused in
> the way that something like Amber does while running in parallel with
> MPI traffic over it. The issue comes about because ethernet has both
> an extremely bad system for dealling with packet loss due to
> congestion or lack of buffer space and when it does get overloaded it
> fails catastrophically, unlike interconnects that were designed to be
> low latency MPI carriersn which degrade gracefully when overloaded.
>
> The issue, apart from the fact that TCP/IP has massive cpu overhead
> and ethernet latency is aweful, is the fact that most of the ethernet
> cards and switches that people buy for clusters are complete junk. For
> example in your case I suspect most of the problems are a function of
> a cheap switch, cheap network cards or both. In terms of the switch a
> number of problems exist that are mainly a function of the average
> gigabit switch being designed for routing web traffic and email where
> latency and flow control is not essential. To save money the switch
> manufacturers install both a cheap backplane and small amounts of
> buffer space. For example it is typical for a 32 port gigabit switch
> to have only a 10 gigabit backplane - hence while each port can
> theoretically do 1 gigabit to each other port if they all try it at
> once your real bandwidth drops to 0.312 gigabits per port which in
> full duplex mode only gets you around 150 MBPs. Hence you should make
> sure any gigabit switch you get, if you insist on using gigabit
> ethernet, is non-blocking across all ports. Note this will probably
> add a zero on the end of the price.
>
> The second problem is that by default most switches either do not
> support flow control or have it turned off. What this means is that if
> the switch runs out of buffer space due to being overloaded, as Amber
> will easily do, you get massive packet loss rather than throttling.
> Each lost packet costs you between 500 and 2000ms which as you can
> expect instantly destroys any parallel scaling. You can try turning on
> flow control on the switch if it supports it although the ethernet
> cards need to support it as well so you likely need server quality
> cards, that do proper offloading of cpu work. I suspect the fact that
> most cards do not support it is why most switches that support flow
> control have it turned off by default.
>
> You can check if it is a switch only problem by getting a crossover
> cable and hooking two machines up together and see what happens. If
> things get better then it is your switch that is causing problems, if
> it doesn't then it could be the cards or both.
>
> Even with a decent set of cards, switch and a completely seperate
> network for MPI and NFS traffic I wouldn't expect much out of
> ethernet. The issue is many fold. Firstly Amber itself has improved in
> performance tremendously over the years, just see the amber website
> for a great example by Dave Case. Thus the communications required per
> second for a given processor has increased with each new version of
> Amber. Next cpus are significantly faster today and there are more in
> a box than their used to be. I.e. if you used to have a single 1GHz
> chip in a box with a single gigabit card then your communication
> bandwidth was 1 gigabit per GHz. Now consider the fact that people
> have dual x dual core 3GHz chips. They still put just one gigabit
> ethernet card in a box so your bandwidth now is around 80MBPs per GHZ.
> Quite a difference. The real issue though is that at 1GHz single chips
> the computation speed was such that the interconnect was never maxed
> out, the switch never ran out of buffer space and everything worked
> great. Now a single box easily maxes out the switch and causes a
> catastrophic failure due to packet loss so things look much worse than
> the 80MBPs per GHZ would suggest.
>
> Short answer:
>
> 1) Try flow control if your hardware supports it.
>
> 2) Try putting an extra server quality NIC in each box and hooking
> them up with cross over cables - this will at least (hopefully) allow
> you to run across two boxes. If you are really clever and put 2 extra
> NICs in each box then you might be able to get to 3 boxes although you
> will have to be clever in how you configure the MPI stack.
>
> Beyond this there is not much that you can do short of switching to a
> better interconnect.
>
> I'm not sure what you mean by with Intel 100MBps there is normal
> scaling. Are these much much slower chips? Or perhaps just a better
> switch with flow control.
>
I mean that we have another cluster with intel processors and there
there is no problem if I use more then 4 processors. The communicaction
there is with a 100Mbps ethernet

> Note the reason Turbomol works okay is that it does far more
> computation per communication event. Hence it doesn't drive the switch
> so hard and doesn't cause it to loose packets due to overload.
>
> I hope this helps. I am afraid there is no silver bullet here and the
> laws of physics are against you...
>
Ok thanks, I will discus your suggestions with the system administrator.
The university sypercomputer has also AMD opteron processors with beter
interconection, I will trie to run some test there to compare.

thanks to all for the help

servaas

> Good luck,
>
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not be read every day, and should not be used for urgent or sensitive
> issues.
>
>
>
> ______________________________________________________________
> From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu]
> On Behalf Of servaas michielssens
> Sent: Tuesday, December 11, 2007 01:35
> To: amber.scripps.edu
> Subject: Re: AMBER: amber on AMD opteron-250
>
>
>
> More info:
>
> 2cpu per node
> gigabit ethernet network connection
> few NFS traffic (but there is some)
> Other programs work normal (e.g. turbomol, this scales ok till
> 16 cpus) but gromacs for example shows the same behaviour as
> amber (and is slower on AMD).
>
> So my main problem is the jump when you take more than 4
> cpu's, calculations are faster on 4 cpu's than 8. Scaling from
> 2 to 4 is ok, but the main problem is more than 4 cpus. Any
> suggestions there?
> (With intel 100Mbit ethernet network there is a normal
> scaling)
>
>
> kind regards,
>
> servaas
>
>
>
>
>
>
>
> ----- Original Message -----
> From: Robert Duke
> To: amber.scripps.edu
> Sent: Wednesday, December 05, 2007 10:31 PM
> Subject: Re: AMBER: amber on AMD opteron-250
>
>
> No, it should not be that bad, even for gigabit
> ethernet, presuming this is a more-or-less standard
> pme run. If I run pmemd 8, JAC benchmark (pme, nve
> simulation, 500 steps, ~23K atoms) on my two intel
> xeon 3.2 GHz dual cpu workstations connected with an
> XO cable, GB ethernet, server nics, I get the
> following runtimes:
>
> # procs wallclock sec
> 1 186
> 2 113
> 4 64
>
> The 3.2 GHz xeons and opterons really have pretty
> similar performance.
>
> So if you look at the 2 --> 4 processor performance,
> it comes pretty close to doubling. The 1-->2
> processor performance typically does not for small
> dual core nodes; this is a matter typically of shared
> cache and other sharing effects, as well as the fact
> that there is a ton of overhead in the parallelization
> code that has maximum impact and minimum benefit at 2
> cpu's (and the single cpu code has none of this - it
> is essentially a separate implementation, optimized
> for the single processor). You don't show single
> processor performance at all though. PMEMD 9
> performance is even better. So you have other things
> going on.
> Regards - Bob
> ----- Original Message -----
> From: David LeBard
> To: amber.scripps.edu
> Sent: Wednesday, December 05, 2007 3:29 PM
> Subject: Re: AMBER: amber on AMD opteron-250
>
>
> Hi Servaas,
>
> This is generally due to your network, which
> you did not mention so I assume we are talking
> about the gigabit ethernet, and to the number
> of CPU's per node, which also you neglected to
> specify. However, with my experience on dual
> CPU opterons (240's and 248's) and a gigabit
> ethernet these numbers seem about right.
> Unfortunately you may only be able to get good
> scaling for 20k atoms upto 32 CPUs, but only
> if you have a faster network like infiniband
> or myirnet or the like.
>
> Good luck,
> David LeBard
>
> On 12/5/07, servaas michielssens
> <servaas.michielssens.student.kuleuven.be >
> wrote:
> I ran a 20ps simulation of a system of
> 20000 atoms on an AMD opteron 250
> cluster with 8 processors, I used
> amber8 and pmemd for the simulation. I
> found some strange results:
> proc time(min)
> 2 31
> 3 29
> 4 20
> 5 23
> 6 24
> 7 20
> 8 21
>
> 4 processors gives the optimum, it
> seems to be independent of how I
> adress the processors. So for 5
> processors 1-2-3-4-5 or 1-2-3-4-7
> gives
> the same results, always on for
> processors there is an optimum.
> Anyone
> who experienced this scaling problem?
>
> kind regards,
>
> servaas michielssens
>
>
-----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to
> amber.scripps.edu
> To unsubscribe, send "unsubscribe
> amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Dec 12 2007 - 06:07:34 PST