AMBER: Improving PMEMD performance on gigabit ethernet Linux clusters from Robert Duke on 2004-08-16 (Amber Archive Aug 2004)

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 16 Aug 2004 17:17:45 -0400

Folks -
I earlier promised to post some info on configuring mpich for better
performance when running pmemd (or for that matter, sander) jobs. I don't
know if everyone already knows this stuff; if so I apologize. However, I
found the MPICH manual to be a bit confusing, and not entirely correct or
complete, and had to do a fair bit of testing to get a simple gigabit
ethernet system to run pmemd with very low net latency.

The system I use is simple and relatively cheap ($8K US) -
2 3.2 GHz dual xeon pc's running a current release of redhat linux, with a
dedicated connection via a category 6 crossover cable and two ethernet cards
based on the intel 82545 chip (my specific cards are the intel pro/1000 mt
server adapter - you pay more for the server cards, but they are not
outrageous, and they offload the cpu - thanks to Dave Konerding for the
original suggestion). I don't need an ethernet switch with only 2 machines,
but if you have more machines, you should get a really good switch, or it
will be the bottleneck (others with vast experience on this issue, please
specify a list of good choices).

Configuring the OS and MPI

Okay, the way to get performance, at least in terms of reducing net latency,
is to increase the socket buffer size. You actually need to do this at two
levels:

1) At the OS level, you need to zap a couple of values that determine the
upper limit allowed for socket buffers. This is, in my opinion, best done
by adding the following two lines to the system file /etc/rc.d/rc.local (as
root, of course), and rebooting:

echo 1048576 > /proc/sys/net/core/rmem_max
echo 1048576 > /proc/sys/net/core/wmem_max

This sets the upper limit on socket buffers to 1 MB.

2) At the level of mpi, you need to make the following entry in your .cshrc
(or the equivalent commands in .bashrc, if you bash):

setenv P4_SOCKBUFSIZE 524288

This is the only way I have found, despite the other doc'd ways, to get
mpich to use a bigger socket buffer, and it only applies to the ch_p4 device
as far as I know. Here I am using half the max allowed by the r/wmem_max
setting; I set the system to a larger maximum just in case I want to bump
this value up without hassles sometime.

For big systems, you may also need to bump up P4_GLOBMEMSIZE; mine is at
838860. If this value is too low, your run will fail but there will be a
helpful error message. This, as far as I know, only has an impact on
successful initialization as opposed to performance.

Okay, what do you get for your efforts? For the fix ~91K atom problem
running on 4 processors, performance improves by about 29%. This occurs
because net latency drops from 27% to 5%. Worthwhile, in my opinion.
Overall, using the above fix problem, you can get 238 psec/day out of 4
processors, costing you ~$8K, so that is not bad.

Specific data for various P4_SOCKBUFSIZE settings follows. The test is
factor ix, constant pressure pme, 8 ang cutoff, 250 steps, .0015 psec step,
on 4 3.2 GHz xeon processors.

P4_SOCKBUFSIZE, bytes cpu time wall clock time

default (16Kbytes?) 127.55 175 - note latency

  16384 127.52 174

  32768 128.04 155

  65536 126.86 144

131072 128.46 138

262144 128.78 137

524288 129.13 136

1048576 128.7 135

Other things to watch out for:

Another source of potential misery for the user of a small setup revolves
around setting up the process group file(s). The only way I have succeeded
in getting processes running in the right places when I have a dedicated
connector (ie., I am using mpi over something other than the system's main
net card) is to edit a pgfile and point to it in the mpirun command. Thus,
my default pgfile looks like:

tiger_hs 1 /work/exe/pmemd
lion_hs 2 /work/exe/pmemd

and I start jobs using "mpirun -p4pg ~/pgfile /work/exe/pmemd <pmemd args>"

What the pgfile above does is specify the correct nic card (*_hs in my hosts
file) and specify the number of processes per system. I start jobs from
tiger, so you decrement the process count by one (the pgfile format somewhat
insanely specifies the number of ADDITIONAL processes to start -geez). So
this pgfile will start 4 processes. If you are not getting the performance
you expect, look at the pmemd logfile output. If the cpu utilization is
uneven by more than 5% or so, or if there are not the number of processors
you anticipated, something is wrong with where and how mpich is running your
jobs.

I PRESUME this sort of thing is not necessary if you are not using dedicated
nic's that differ from the nic's pointed at by hostname. Then the
machines.LINUX file under mpich/share could probably just contain:

tiger : 2
lion :2

and everything would be great. This absolutely does not cause the right
things to happen for an mpirun -np 4 on my systems, though.

This stuff is way more complicated than it should be. Anyone with influence
with the mpich folks should maybe point that out ;-) Most folks hopefully
get insulated from all this junk by their sys admins.

Regards - Bob Duke

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Tue Aug 17 2004 - 07:53:01 PDT