Re: AMBER: Re: Strange problems with PMEMD on Intel Xeons with Infiniband from Robert Duke on 2006-07-06 (Amber Archive Jul 2006)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 6 Jul 2006 08:43:24 -0400

Ah, this stimulates another thought. Using -DSLOW_NONBLOCKING_MPI causes
fewer simultaneous operations in mpi and thus requires less mpi buffer
space; the probability of a large number of mpi requests coming into the mpi
layer is reduced. So one thing that can happen, at least on mpich systems,
is a deadlock on buffers in the mpi layer. This generally happens depending
on how you set up the system and mpi buffer space, but allowing really large
reads/writes over the interconnect could cause problems (requires more
system buffer). So there is a tradeoff on how big these buffers can be for
improved efficiency vs. deadlocking. I see this as a bit of a flaw in the
mpi layer; others (like the mpi guys) would perhaps like to blame the guys
they sit on top of... I discussed this problem in the context of large
simulations running on gigabit ethernet linux clusters a while back on the
reflector, having managed to blunder around and get my own systems to hang.
Sorry I did not think of this first; I did not think about the fact that
this problem could be worse if you don't use -DSLOW_NONBLOCKING_MPI. Rather
than reconstruct what I said then, I will just copy it below, complete with
a bit of extraneous info related to the earlier problem. Perhaps your
system guys could tweak settings a bit to help, but just
using -DSLOW_NONBLOCKING_MPI seems to be working, so maybe nothing has to be
done.
Thanks for the update!
Best Regards - Bob Duke

==> BEGIN EMBEDDED OLD MESSAGE:

(Note that the description here is for dual cpu nodes communicating to at
most one other process at a time (SLOW_NONBLOCKING_MPI). Thus, if you don't
define SLOW_NONBLOCKING_MPI, you would probably want rmem_max/wmem_max to be
a LOT bigger than MPICH_SOCKET_BUFFER_SIZE (or whatever the equivalent is
for your infiniband installation; I presume you are using mvapich?). Each
process needs to be able to simultaneously post a read and a write to every
other process it simultaneously communicates with, and you probably have a
dual cpu setup so multiply by 2; then add a bit more to allow for the fact
that the system might do things over the interconnect that you are unaware
of. Sorry if this is a bit vague; these are system configuration issues
really.)

Folks -
A wild guess here. Sometimes with mpich there can be problems with
deadlocks if too much memory is used, given the mpich configuration, and
this will look like an infinite loop, but what it is is a hang on a deadlock
in the kernel over buffer space. I think this problem is worse for sander
than pmemd because it uses more mpi memory, but it can happen to either.
The critical interplay is between P4_SOCKBUFSIZE (mpich environment variable
under mpich 1.2.x, I think it is MPICH_SOCKET_BUFFER_SIZE in mpich 2) and
kernel networking memory params. What I do is:

In /etc/rc.d/rc.local put the two lines:

echo 1048576 > /proc/sys/net/core/rmem_max
echo 1048576 > /proc/sys/net/core/wmem_max

This way, every time you reboot there is a substantial chunk of memory
dedicated to net buffers. Doing this of course requires root privileges.

Then set P4_SOCKBUFSIZE (MPICH_SOCKET_BUFFER_SIZE for mpich 2) to something
like 131072 in your .cshrc or wherever makes sense for you.

The critical point here is that you need sufficient memory set aside that a
read and write operation can be underway simultaneously in each mpi process,
or things will deadlock, and when you run mpich on dual processor machines,
the amount of net buffer space increases (so you see above I am specifying 8
x as much memory in the kernel as in P4_SOCKBUFSIZE; I don't know what the
minimum "overage" required to prevent deadlocks is, but this config works
well for my machines).

Now, with mpich you will also need a very large value from P4_GLOBMEMSIZE; I
set my machines to something like 134217728 to be able to run the rt
benchmark on sander; pmemd requires a fraction of this. The run always dies
with an obvious error message when this is a problem.

Another point: These large buffer sizes DO improve mpich/gigabit ethernet
performance significantly. There are also issues about being sure the right
number of processors start on the right machines, and that your server nics
(you did buy expensive but faster server nics for your back end didn't you,
and you do have a separate local lan interconnecting the machines, right?)
are where the mpi i/o occurs. The only way I have found to get the right
number of processes on the machines and using the right interconnects is
with a "process group file" where I can reference the interconnect - see the
mpich doc. All these things make a huge difference for gigabit ethernet lan
performance. I currently get the following throughput on 3.2 ghz dual cpu
p4's connected as described above for factor ix const pressure (90906
atoms):

#proc psec/day
1 114
2 182
4 291

Note this is current in-development code, not pmemd 8. Basically you DON'T
get linear scaling on something like factor ix on these small systems with
gigabit ethernet because the distributed fft transposes are huge and
overwhelm the interconnect bandwidth. There is not nearly as much of a
problem for shared memory machines or real supercomputers (the 1 to 2
processor scaling drop is actually largely a cache sharing issue on these
small machines as you don't use the nic's; once you go to 4 procs, though,
you use the nics).

Okay, I may or may not have ever posted anything on this; I don't remember.
But if I didn't, the reason I didn't was because these are machine-specific
instructions that work with RedHat linux and probably a variety of other
linuxes (but probably not all), and that work with mpich. So you may have
to poke around for your specific machine. If you have a canned vendor
setup - like something from sgi or what have you, they probably get the base
config correct; the grief comes when you take a generic system and put your
own mpi(ch) on top of it. I have not looked at LAM, but there is no reason
it would not also be susceptible to the problem. This sort of thing
reflects a lack of deadlock avoidance software down there somewhere.

Sorry if this is not at all your problem; in my case though, this is the
source of rt benchmark hangs for sander 8 or pmemd 8.

Regards - Bob Duke
  ----- Original Message -----
  From: Imran Khan
  To: amber.scripps.edu
  Sent: Friday, October 14, 2005 12:37 PM
  Subject: Re: AMBER: AMBER goes in a Loop

  Hi David,

  Yes, all other benchmark tests ie.. hb, jac and gb_alp etc.. run
successfully for 8 processors. Also, they run for 1, 2 and 4 processors.

  The problem is only with rt 8 processor run.

  Imran

  On 10/14/05, David A. Case <case.scripps.edu> wrote:
    On Thu, Oct 13, 2005, Imran Khan wrote:
>
> I am trying to run a benchmark called `rt` using Amber for 8
processors,
> where it sorts of goes into an infinite loop. However it runs
successfully
> to completion for less than 8 processors. This is for sander runs.

    Does the system work for other benchmarks, e.g. "jac" or "hb"? I'm
trying to
    find out if the problem is specific to the "rt" test case. Also, does
the
    system pass all the tests at 8 processors?

    ...dac

<== END EMBEDDED MESSAGE

----- Original Message -----
From: "Fabian Boes" <fabian.boes.itb.uni-stuttgart.de>
To: <amber.scripps.edu>
Sent: Thursday, July 06, 2006 4:29 AM
Subject: AMBER: Re: Strange problems with PMEMD on Intel Xeons with
Infiniband

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> just a quick notice that i was able to solve my problem with the PMEMD
> jobs hanging and producing no output files.
>
> The (maybe not so perfect) solution was to compile with
> - -DSLOW_NONBLOCKING_MPI (see README file in src/pmemd). Now the jobs run
> fine but at the cost of approx. 20% speed loss.
>
> Unfortunately i can't provide details about the used Infiniband drivers
> and MPI versions as we do not operate that cluster by ourselves.
>
> Bye,
>
> Fabian
>
> - --
>
> Fabian Bös
>
> Institute of Technical Biochemistry
> University of Stuttgart / Germany
>
> Phone: +49-711-685-65156
> Fax: +49-711-685-63196
> Email: fabian.boes.itb.uni-stuttgart.de
>
> http://www.itb.uni-stuttgart.de
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (MingW32)
>
> iD8DBQFErMnvLl4SF3oeQ9ARAnfDAJ98sp46fiOwMjlumRJbkdH8uILvFwCeOGPR
> 1iSDMF3qYyXMCk8qHpogjfc=
> =ovUY
> -----END PGP SIGNATURE-----
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Jul 09 2006 - 06:07:13 PDT