RE: AMBER: Sander slower on 16 processors than 8

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 22 Feb 2007 14:15:07 -0800

Dear Steve,
 
To understand what you are seeing a takes a deeper understanding of the
concept of a cluster. Just remember "all clusters are not created equal".
 
The issue you see here is that your interconnect is gigabit ethernet and
this is pretty much useless. Especially since in the days of single cpu 1GHz
processors I was using 1GB ethernet and now we have 3.8GHz quad core
processors and yet people still use gigabit ethernet. Hence the effective
communication speed per cpu is now 250 Mbps and the processors are almost 4
times faster, so in floating/point to internode communication speed it is
like you are now using 62 megabit ethernet :'(. So in real terms we are not
just marching backwards in terms of interconnect performance on typical
clusters but in fact we are positively sprinting in the wrong direction...
At least in terms of scaling. If you look at the ONLY real metric that
matters "Wallclock time to solution" then things aren't quite so bad ;-).
 
Anyway, what you are seeing is typical of gigabit ethernet. Firstly the
latency is aweful, secondly the badwidth is bad, thirdly not all switches
are non-blocking, fourthly people often chain these switches together in a
highly blocking fashion, fifthly people also route NFS traffic over the same
network as MPI and finally everything has to be wrapped up in TCP/IP packets
that are designed for sending data half way around the world over the
internet and not along 2m of cable. In the final case the poor cpu spends
its entire time putting TCP/IP packets together and then unwrapping them at
the other end. Thus over gigabit ethernet on a modern cpu scaling to just 8
cpus is pretty typical. You're lucky to if you get much beyond this.
Although really you should be using PMEMD if you can as this is designed to
scale better than sander (although it has a smaller feature set). See the
PMEMD section of the AMBER 9 manual.
 
Beyond this if you want a cluster to do MD better in parallel then you
really need to invest in a decent interconnect such as quadrics. Or consider
applying for time on some of the NSF supercomputers which typically are
built to have much much better communication. E.g. PMEMD will happily scale
to 128 cpus on DataStar at SDSC for a 100K atom PME system and even better
for a GB system. (see http://coffee.sdsc.edu/rcw/amber_sdsc/ and
http://www.sdsc.edu/us/allocations/)
 
Beyond that if you want to try and get some more out of your cluster I would
first start by putting in a second ethernet network on an independent switch
and make sure all NFS traffic goes over one network and all MPI over
another, you could experiment with this by temporarily canibalising some of
the other nodes in your system, assuming the ethernet controllers aren't
physically on the motherboard. I would then make sure the queing system for
you cluster is setup so you always get nodes exclusively to yourself and
ideally nobody else is running on the cluster at the same time as you. Next
I would play about with increasing the network buffer sizes
(http://amber.ch.ic.ac.uk/archive/200408/0202.html), make sure you are using
PMEMD and finally try playing about with some hardware routing systems that
don't use TCP/IP - There are some implementations around, their names slip
my mind at the moment though. I played with some about 6 years or so ago but
then they were still pretty aweful and buggy at which point I just gave up
on ethernet for MPI and switched over to Scali, Quadrics and then
Infiniband.
 
I'm sorry I can't help much more than this but basically you are up against
the limits of gigabit ethernet and your chances of getting any MD simulation
software to scale well on such an outdated interconnect is unlikely.
 
Note, if you do replica exchange then it is a different story as even on
gigibit ethernet you should be able to run 4cpu x 32 replicase for 128 cpus
- as long as you don't max out the capabilities of the NFS server for disk
I/O. If you do then you need to start considering putting in a san switch
and some fibre channel disk.
 
Good luck,
 
Ross
 
/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
 



  _____

From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of
Sontum, Steve
Sent: Thursday, February 22, 2007 12:33
To: amber.scripps.edu
Subject: AMBER: Sander slower on 16 processors than 8


I have been trying to get decent scaling for amber calculations on our
cluster and keep running into bottlenecks. Any suggestions would be
appreciated. The following are benchmarks for the factor_ix and jac on 1-16
processors using amber8 compiled with pgi 6.0 except for the lam runs which
used pgi 6.2

 

BENCHMARKS

mpich1 (1.2.7) factor_ix 1:928 2:518 4:318 8:240 16:442

mpich2 (1.0.5) factor_ix 1:938 2:506 4:262 8:*

mpich1 (1.2.7) jac 1:560 2:302 4:161 8:121 16:193

mpich2 (1.0.5) jac 1:554 2:294 4:151 8:111 16:181

lam (7.1.2) jac 1:516 2:264 4:142 8:118 16:259

 

* timed out after 3hours

QUESTIONS

First off, is it unusual for the calculation to get slower with increased
number of processes?

Does anyone have benchmarks for a similar cluster, so I can tell if there is
a problem with the configuration of our cluster? I would like to be able to
run on more than one or two nodes.

 

SYSTEM CONFIGURATION

The 10 compute nodes use 2.0GHz dual core opteron 270 chips with 4GB memory
and 1Mb memory Cache, tyan 2881 motherboards, HP Procurve 2848 switch, and
single 1Gb/sec Ethernet connection to each motherboard. The master node is
configured similarly but also has a 2TB of raid storage that is automounted
by the compute nodes. We are running SuSE 2.6.5-7-276-smp for the
operating system. Amber8 and mpich were compiled with pgi 6.0.

 

I have used ganglia to look at the nodes when a 16 process job is running.
The nodes are fully consumed by system CPU time. The User CPU time is only
5% and this node is only pushing 1.4 kBytes/sec out over the network

 

Steve

------------------------------

Stephen F. Sontum
Professor of Chemistry and Biochemistry
email: sontum.middlebury.edu
phone: 802-443-5445


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Feb 25 2007 - 06:07:25 PST
Custom Search