Re: AMBER: amber8 parallel sander

From: Robert Duke <>
Date: Mon, 17 Jan 2005 13:13:10 -0500

Yen -
Some wild guesses here. Less than 100% cpu utilization indicates that you DO have network problems (you are blocked waiting on the network instead of doing calcs). Also, seeing high utilization in the poll() probably indicates that you are wasting a lot of your cpu time spin-locking in the poll, waiting for network communications to occur. A good diagnostic is the time data in the logfile or mpi_profile file. This will give you some indication of how much time is being spent in communications. I would GUESS that you have one or all of the following going on 1) your system tcp/ip buffer size is small, 2) the network cards are slow, and possibly not on a fast system bus, 3) the ethernet switch is not operating at full duplex, and/or is basically slow. Just because you don't appear to be pushing the net to a full GB/sec (actually with full duplex, you push it this fast both ways) does not mean that you don't have a net problem. The problem with your hardware and how the system is configured may well be that they are not truly capable of attaining 2 GB/sec throughput (ie., both ways), or they may just need different configs. IF you are seeing 100% cpu utilization, and it is not in poll() and other network calls, and the metric for FRC_collect time in profile_mpi (or FXdist in the logfile of pmemd) is small, then your network is doing okay.
Regards - Bob
  ----- Original Message -----
  From: yen li
  Sent: Monday, January 17, 2005 12:55 PM
  Subject: Re: AMBER: amber8 parallel sander

  Hi Robert & Carlos,
  Thanks for the elaborate replies.

  It's true that we are sharing the switched GB ethernet for other
  purposes also. To check wether network is the chocke point, we collected
  network statistics while the benchmarking was running for 16 processors.
  We find that the network utilization is never more than 10% for any of
  the hosts and the link is up at 1GB/s. While doing this we noticed
  unusually high system usage % like +50%. To find out the cause, we
  collected the system calls being generated by one process of the 16 processes
  ("$> truss -p pid"). The results show that it's mostly system call
  "poll" which returns 0(like +80% of the time) and error EAGAIN(like 15% of
  the time).
  For linux the equivalent command of truss would be "$>strace -p pid".
  Can someone please suggest any way to improve the performance.
  Best Regards,

> Robert Duke <> wrote:
> Yen -
> Once again I will chime in after Carlos :-) Especially with GB
> "your mileage may vary" as you attempt to scale up to higher
> count. The performance gain can quickly be inconsequential, and you
> even lose ground due to all the parallel overhead issues. Are you
> sander or pmemd? PMEMD is faster in these scenarios and takes less
> but it only does a subset of the things that sander does, and still
> overcome all the inadequacies of GB ethernet. So what are the major
> that determine how well things work? In a nutshell, they are:
> 1) Problem size. As you have more atoms, it takes more memory on each
> for the program AND for MPI communications. I cannot get sander 8 to
  run at
> all on 4 pentium nodes/GB ethernet/mpich for the rt polymerase
  problem which
> is ~140K atoms. The atoms are constrained which actually makes things
> in some regards (extra data). It runs fine for sander 8 on 1 and 2
> processors. I runs fine for pmemd on 1, 2, and 4 pentium/GB nodes.
> really important problem in all this is the distributed fft transpose
> reciprocal space calcs. That quickly swamps a GB interconnect for
> problem size.
> 2) SMP contribution. An aspect of pentium GB ethernet setups that is
> not emphasized is that dual pentium cpu's (they share memory) are
> used. Now this is bad from a standpoint of memory cache utilization,
> because each processor has less cache to work with, and depending on
> hardware there are cache coherency issues. But from an interconnect
> standpoint it is great because mpi running over shared memory is
> significantly faster than mpi over GB ethernet. So if you have a
> of uniprocessors connected via GB ethernet, I would not expect much.
> 3) Network hardware configuration. Are the network interface cards on
> ultrasparc's server-grade; ie. capable of running full bandwidth full
> GB ethernet without requiring significant cycles from the cpu? If
  not, then
> things won't go as well. Server nic's typically cost more than twice
> much as workstation grade nics. How about the ethernet switch? Cheap
> DO NOT work well at all, and you will have to pay big bucks for a
  good one
> (folks can chime in with suggestions here; I run two dual pentiums
> with a XO cable so there is no switch overhead). Cables? Well, think
> 5e or 6, I believe. This is getting to be more commonly available;
  just be
> sure your cables are rated for 1 Gbit and not 100 mbit/sec.
> 4) MPI configuration. I don't mess with suns, so I have no idea if
> have their own mpi, or if you are running mpich or whatever. If you
> running mpich, there are ways to screw up the s/w configuration, not
> allowing for shared memory use, not giving mpich enough memory to
  work with,
> etc. There are some notes on the amber web site, probably by me,
> Hornak, et al.
> 5) System configuration. For linux, there are networking config
> controlling how much buffer space is set aside for tcp/ip
> This can have a big effect on how fast communications actually is,
> typically you sacrifice system memory for speed. See the amber web
  page for
> notes from me relevant to linux; once again I have no idea what is
> for sun hw/sw.
> 6) Other network loading issues. Is the GB ethernet used a dedicated
> ethernet, with no traffic for internet, other machines, NFS, etc.,
  etc.? Is
> anyone else using other nodes at the same time (ie., perhaps several
> jobs running over the same GB ethernet). If there is any other
  network load
> whatsoever, your performance will be worse, and it may be
> worse.
> What is the best you can expect? Well, my latest performance work for
> (not yet released version) yields the following throughput for my two
> pentiums (3.2 GHz), XO GB-connected, running 90,906 atoms, constant
> pressure, particle mesh ewald:
> # proc psec/day
> 1 95
> 2 155
> 4 259
> Now, you will note that the scaling is not great, and this is about
  as good
> as it gets for this kind of hardware. This IS a large problem (91K
> and you should do significantly better on scaling if your problem is
> smaller. By the way, comparison numbers for 1 and 2 procs, this same
> hardware, pmemd 8 and sander 8 are:
> # proc psec/day, pmemd 8 psec/day, sander8
> 1 76
> 54.5
> 2 121
> 88
> Now I don't have any data for 8 and 16 processors here, simply
  because I no
> longer have access to that type of hardware in reasonable condition.
> while back I did runs on 2.4 GB blades at UNC for pmemd 3.0- 3.1, and
> seeing numbers like this (abstracted from pmemd 3.0 release notes,
  there is
> lots of good info about performance on various systems in the various
> release notes, available on the amber web site). NOTE that I had
> access to the blade cluster (no other jobs running on it), and the GB
> ethernet was dedicated to mpi, not shared for internet, etc.:
> The mpi version was MPICH-1.2.5. Both PMEMD and
> Sander 7 were built using the Intel Fortran Compiler.
> 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
> #procs PMEMD Sander 6 Sander 7
> psec/day psec/day psec/day
> 2 85 46 59
> 4 153 71 99
> 6 215 ND ND
> 8 272 114 154
> 10 297 ND ND
> 12 326 122 ND
> 14 338 ND ND
> 16 379 127 183
> There are reasons that people spend those large piles of money for
> supercomputers. PMEMD runs with pretty close to linear scaling on
> with a fast interconnect out to somewhere in the range of 32-64
> and is usable (greater than 50% scaling) at 128 processors and
  beyond. I
> can get close to 7 nsec/day for the problem above on that sort of
> (once again, unreleased version of pmemd, but you will see it in the
> future).
> Regards - Bob Duke
> ----- Original Message -----
> From: yen li
> To:
> Sent: Wednesday, January 12, 2005 8:54 AM
> Subject: Re: AMBER: amber8 parallel sander
> Hi,
> Thanks Robert & Carlos for the clearifications.
> I have one more related doubt. I also timed the same simulations for
  the 4
> cases: namely 1, 4, 8 & 16 processors. I find that it's the fastest
  for 4
> and slower for 1, 8 &16. I can understand for 1 but cannot understand
> getting slower for increased number of processors.
> All the processors are of the same make(Sun UltraSparc III+), same
> OS(Solaris 8), same amount of RAM(1GB each) and connected over 1GBps
> network.
> Thanks
> Robert Duke wrote:
> Yen -
> As Carlos says, this is expected. The reason is that when you
> the job, the billions of calculations done occur in different orders,
> this introduces different rounding errors. With pmemd, you will even
> differences with the same number of processors, and this is because
  there is
> dynamic load balancing of direct force workload (ie., if one
  processor is
> taking more time to do the direct force calcs, it will be assigned
> atoms to work on). You have to remember that the internal numbers
  used have
> way more precision than is justified by our knowledge of parameters,
  or for
> that matter how well the method represents reality, and that any one
> represents one of many possible trajectories.
> Regards - Bob Duke
> ----- Original Message -----
> From: yen li
> To:
> Sent: Wednesday, January 12, 2005 5:57 AM
> Subject: AMBER: amber8 parallel sander
> Hello amber
> I am testing the parallel version of amber8. I am running an md
> over a small protein.
> I am testing the calculations on four, eight and sixteen processors.
> problem is that, initially the energy values in the output files are
> same, but as the simulation proceeds, the values start to diverge
  making the
> differences large. Is this kind of behaviour ok or do i need to take
  care of
> some parameters.
> Thanks
> Do you Yahoo!?
> Yahoo! Mail - Find what you need with new enhanced search. Learn
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around

  Do you Yahoo!?
  The all-new My Yahoo! - What will yours do?

The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to
Received on Mon Jan 17 2005 - 18:53:01 PST
Custom Search