Re: AMBER: amber8 parallel sander

From: Robert Duke <>
Date: Wed, 12 Jan 2005 10:20:40 -0500

Yen -
Once again I will chime in after Carlos :-) Especially with GB ethernet,
"your mileage may vary" as you attempt to scale up to higher processor
count. The performance gain can quickly be inconsequential, and you can
even lose ground due to all the parallel overhead issues. Are you running
sander or pmemd? PMEMD is faster in these scenarios and takes less memory,
but it only does a subset of the things that sander does, and still cannot
overcome all the inadequacies of GB ethernet. So what are the major factors
that determine how well things work? In a nutshell, they are:
1) Problem size. As you have more atoms, it takes more memory on each node
for the program AND for MPI communications. I cannot get sander 8 to run at
all on 4 pentium nodes/GB ethernet/mpich for the rt polymerase problem which
is ~140K atoms. The atoms are constrained which actually makes things worse
in some regards (extra data). It runs fine for sander 8 on 1 and 2
processors. I runs fine for pmemd on 1, 2, and 4 pentium/GB nodes. The
really important problem in all this is the distributed fft transpose for
reciprocal space calcs. That quickly swamps a GB interconnect for large
problem size.
2) SMP contribution. An aspect of pentium GB ethernet setups that is often
not emphasized is that dual pentium cpu's (they share memory) are often
used. Now this is bad from a standpoint of memory cache utilization,
because each processor has less cache to work with, and depending on the
hardware there are cache coherency issues. But from an interconnect
standpoint it is great because mpi running over shared memory is
significantly faster than mpi over GB ethernet. So if you have a collection
of uniprocessors connected via GB ethernet, I would not expect much.
3) Network hardware configuration. Are the network interface cards on the
ultrasparc's server-grade; ie. capable of running full bandwidth full duplex
GB ethernet without requiring significant cycles from the cpu? If not, then
things won't go as well. Server nic's typically cost more than twice as
much as workstation grade nics. How about the ethernet switch? Cheap ones
DO NOT work well at all, and you will have to pay big bucks for a good one
(folks can chime in with suggestions here; I run two dual pentiums connected
with a XO cable so there is no switch overhead). Cables? Well, think CAT
5e or 6, I believe. This is getting to be more commonly available; just be
sure your cables are rated for 1 Gbit and not 100 mbit/sec.
4) MPI configuration. I don't mess with suns, so I have no idea if they
have their own mpi, or if you are running mpich or whatever. If you are
running mpich, there are ways to screw up the s/w configuration, not
allowing for shared memory use, not giving mpich enough memory to work with,
etc. There are some notes on the amber web site, probably by me, Victor
Hornak, et al.
5) System configuration. For linux, there are networking config issues
controlling how much buffer space is set aside for tcp/ip communications.
This can have a big effect on how fast communications actually is, and
typically you sacrifice system memory for speed. See the amber web page for
notes from me relevant to linux; once again I have no idea what is required
for sun hw/sw.
6) Other network loading issues. Is the GB ethernet used a dedicated GB
ethernet, with no traffic for internet, other machines, NFS, etc., etc.? Is
anyone else using other nodes at the same time (ie., perhaps several mpi
jobs running over the same GB ethernet). If there is any other network load
whatsoever, your performance will be worse, and it may be substantially

What is the best you can expect? Well, my latest performance work for pmemd
(not yet released version) yields the following throughput for my two dual
pentiums (3.2 GHz), XO GB-connected, running 90,906 atoms, constant
pressure, particle mesh ewald:

# proc psec/day

 1 95
 2 155
 4 259

Now, you will note that the scaling is not great, and this is about as good
as it gets for this kind of hardware. This IS a large problem (91K atoms),
and you should do significantly better on scaling if your problem is
smaller. By the way, comparison numbers for 1 and 2 procs, this same
hardware, pmemd 8 and sander 8 are:

# proc psec/day, pmemd 8 psec/day, sander8

 1 76
 2 121

Now I don't have any data for 8 and 16 processors here, simply because I no
longer have access to that type of hardware in reasonable condition. A
while back I did runs on 2.4 GB blades at UNC for pmemd 3.0- 3.1, and was
seeing numbers like this (abstracted from pmemd 3.0 release notes, there is
lots of good info about performance on various systems in the various pmemd
release notes, available on the amber web site). NOTE that I had exclusive
access to the blade cluster (no other jobs running on it), and the GB
ethernet was dedicated to mpi, not shared for internet, etc.:

The mpi version was MPICH-1.2.5. Both PMEMD and
Sander 7 were built using the Intel Fortran Compiler.

90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
#procs PMEMD Sander 6 Sander 7
            psec/day psec/day psec/day
 2 85 46 59
 4 153 71 99
 6 215 ND ND
 8 272 114 154
10 297 ND ND
12 326 122 ND
14 338 ND ND
16 379 127 183

There are reasons that people spend those large piles of money for
supercomputers. PMEMD runs with pretty close to linear scaling on hardware
with a fast interconnect out to somewhere in the range of 32-64 processes,
and is usable (greater than 50% scaling) at 128 processors and beyond. I
can get close to 7 nsec/day for the problem above on that sort of hardware
(once again, unreleased version of pmemd, but you will see it in the

Regards - Bob Duke

----- Original Message -----
From: yen li
Sent: Wednesday, January 12, 2005 8:54 AM
Subject: Re: AMBER: amber8 parallel sander

Thanks Robert & Carlos for the clearifications.

I have one more related doubt. I also timed the same simulations for the 4
cases: namely 1, 4, 8 & 16 processors. I find that it's the fastest for 4
and slower for 1, 8 &16. I can understand for 1 but cannot understand it
getting slower for increased number of processors.

All the processors are of the same make(Sun UltraSparc III+), same
OS(Solaris 8), same amount of RAM(1GB each) and connected over 1GBps


Robert Duke <> wrote:
Yen -
As Carlos says, this is expected. The reason is that when you parallelize
the job, the billions of calculations done occur in different orders, and
this introduces different rounding errors. With pmemd, you will even see
differences with the same number of processors, and this is because there is
dynamic load balancing of direct force workload (ie., if one processor is
taking more time to do the direct force calcs, it will be assigned fewer
atoms to work on). You have to remember that the internal numbers used have
way more precision than is justified by our knowledge of parameters, or for
that matter how well the method represents reality, and that any one
represents one of many possible trajectories.
Regards - Bob Duke
----- Original Message -----
From: yen li
Sent: Wednesday, January 12, 2005 5:57 AM
Subject: AMBER: amber8 parallel sander

Hello amber
I am testing the parallel version of amber8. I am running an md simulation
over a small protein.
I am testing the calculations on four, eight and sixteen processors. My
problem is that, initially the energy values in the output files are the
same, but as the simulation proceeds, the values start to diverge making the
differences large. Is this kind of behaviour ok or do i need to take care of
some parameters.

Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search. Learn more.
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around

The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to
Received on Wed Jan 12 2005 - 15:53:00 PST
Custom Search