Re: AMBER: PMEMD and sander from AMBER6 performances from Robert Duke on 2003-08-07 (Amber Archive Aug 2003)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 7 Aug 2003 15:01:39 -0400

Teletchea -

I think you are actually closer than you realize; in fact you are getting
better performance relative to sander 6 than I am, and that is going to be
the most reproducible number. I did screw up one thing in my last mail to
you, for which I apologize. I stated that pmemd has a skinnb default of
1.5, but the value is actually 1.0. I don't know where I got this number in
my head from, but it pops out every now and then. The sander 6 and pmemd
code use, and the sander 6 doc declares a 1.0 angstrom value for skinnb.
This will not make a big difference in your results (in fact if you took
skinnb out of the mdin you should have gotten the 1.0 default; otherwise
pmemd is running at about a 2% penalty, given that list building takes about
10% of the time, and there is 9.5**3 / 9.0**3 == 1.176 times as much
overhead in list building).

I will comment just on the xeon numbers, since they are the ones to which I
can most easily compare my benchmarks. My athlon runs were with machines
clustered via Myrinet which should scale slightly differently. First of all
I would note that your 2.8 GHz xeons seem to run about as fast as my 2.4 GHz
xeons. The 2 processor psec/day for sander 6 is 131.1 on your machines and
130 on mine. At two processors, communication on my systems is via shared
memory between dual processors; I presume the same is true for your
machines, so the interconnect has not come into play. The 4 processor
psec/day for sander 6 is 198.2 on your machines and 208 on mine. You are
scaling about 5% worse on sander 6, which is close. I don't have 6
processor data for sander 6, so we can't compare. For pmemd at two
processors, you get 261 psec/day and I get 234, so you outperform me around
11%. At four processors, you get 421.5 and I get 408, so the gap has
narrowed to around 3%. At six processors, you get 572.2 and I get 584, so I
am around 2% faster now.

My guesses on all this:
1) Your systems are not 2.8/2.4 times faster than mine; there are other
issues inside the box that make the ratio closer to 1:1.
2) Your interconnect is maybe 10-20% less efficient. This may be due to a
variety of things you can't easily fix. I asked our systems folks what they
did to improve system performance on our machines, and they basically did
not have a clue. I found one suggestion, which I have not been able to try
on 1 GB ethernet, but which may help. According to the MP_Lite doc (which
does not work out of the box for pmemd or sander, so I have not actually
used it), you can get better performance on linux systems by increasing the
tcp buffers sizes on all nodes (write increased integer values, as root, to
/proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_max - up to 4194304,
though I would go up in increments). No guarantees, but I did not destroy
my test node by doing it (but it did not help a pc which apparently has a
really slow nic connected to 100 mbit/sec ethernet - I never see anything
near 100 mbit/sec out of this machine, so while it screams on mflops, it is
a dog on i/o).
3) If you really did specify a skinnb of 1.5 (sorry again!), take it out and
just get the default by specifying nothing. You may then see slightly
better scaling. Increasing the skinnb value not only increases the size of
the nonbonded pairs list, it also increases the overlap in shared
coordinates and forces. One of the big pmemd scalability advantages is that
it does not share all forces, coords, and velocities between nodes.
Instead, different atoms are "owned" or "used". For "owned" atoms, forces
are communicated back to the owner. For "used" atoms, coordinates are
distributed to the users at every step (and all coordinates are exchanged
when the nonbonded pairlist needs building). Finally, velocities are only
reported back to the master for output. The effect of increasing the skinnb
value is to increase the number of atoms "used" by the various tasks, and
this increases communications overhead and decreases scalability (however
making the skin too small increases list build frequency, which requires all
coords to be communicated, so it is a balancing act).

Hope this helps; I think you are really pretty close.

On other PMEMD issues, we are currently porting it to the machines at PSC.
We have it running on lemieux, the alphaserver es45/quadrics terascale
computing facility (3000 nodes, very nice!), and on lemieux we are observing
a 4 fold speedup relative to sander 7 for the 90906 atom constant pressure
problem. Pretty nice. We have it running on the cray t3e there also,
though the speedup is not as impressive (8 byte integers, less memory).

Regards - Bob

----- Original Message -----
From: "Teletchéa Stéphane" <steletch.biomedicale.univ-paris5.fr>
To: <amber.scripps.edu>
Cc: "Robert Duke" <rduke.email.unc.edu>
Sent: Thursday, August 07, 2003 12:55 PM
Subject: Re: AMBER: PMEMD and sander from AMBER6 performances

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Thu Aug 07 2003 - 20:53:00 PDT