Re: AMBER: PMEMD with big systems

From: Florian Barth <bio_hazard.gmx.de>
Date: Wed, 23 Mar 2005 21:48:10 +0100

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Bob,

thanks for your reply. Actually, I only had 'ulimit -S -s unlimited' in
my .bash_profile. I added it to the .bashrc and now PMEMD is running
fine as long as I use mpirun. I always double checked the stack size
setting before running PMEMD, so I thought it was unecessary to have the
command twice in my startup files. Well, I guess I don't need to
understand everything as long as it works. Sorry that I missed the
obvious.
Unfortunately, the ulimit or limit settings in the startup files do not
seem to work for mpiexec (mpirun replacement for PBS), as with it the
the described problem still exists. I will try to contact the author of
mpiexec if he can provide me a workaround.

Btw. I'm using static linked binaries now which seem to run fine on my
linux installation (pass the tests, no segfaults on tested systems so
far). I had an issue with different malloc code being present in both
the libc and libmpich(gm) which actually prevented static linking but
Myricom provided me a small patch for this. I also had to test several
libc versions before everything worked ok.

Thanks again

        Florian

Robert Duke wrote:
| Florian -
| Okay, not to ask 'obvious' questions, but I presume you have 'limit
| stacksize unlimited' in BOTH your .login and .cshrc? For reasons that
| elude me (think system 'features') this is sometimes necessary. Now,
| that may not be it at all. There are a couple of other possible
| issues. There could be mixed shared libraries for ifort on some of the
| nodes; I doubt this but it is one installation issue, and I don't know
| what is done about versioning these libraries. Finally, a likely issue
| is how myrinet s/w is built. I have seen instances where you can run on
| 2 nodes (shared memory) on myrinet but you segfault if you try more,
| which brings the interface into action. Now, I don't really know what
| the root of this evil is, but your observation that you can run if you
| spread the job out enough is interesting. I am wondering if there are
| problems with the thread code in the myrinet software. Here's the
| scenario: If myrinet s/w was built with static linkage to the threads
| libraries, then static threads code would be used, and for at least some
| recent linux releases like redhat 3, these static libraries are known to
| have problems with small thread stacks (think seg fault). Why more
| problems in pmemd than sander? Well, pmemd uses asynch net i/o, which
| requires the use of threads; sander doesn't. This is all guesswork of
| course. I would be sure to get whoever is in charge of supporting your
| myrinet installation to test the stuff with async i/o and a test suite
| that sends around big chunks of data, and would check into exactly how
| it was built (use dynamic system libraries, not static - which implies
| dynamic ifort libraries too). Hope this helps, but I am just guessing.
| I have seen this stuff working on myrinet/opterons/pmemd 8 a couple of
| weeks ago, factor ix and jac, between 2 and 64 procs (2, 4, 8 ,16, 24,
| 32, 40, 48, 56, 64 for factor_ix and jac), several thousand steps, not
| the sign of a problem (but pathscale compiler). It also used to work
| fine here at UNC before our myrinet h/w gave out (that was with earlier
| versions of the fortran compiler).
| If I can be of further help, please let me know, and if you all figure
| it out, please let everyone know also.
| Regards - Bob Duke
|
| ----- Original Message ----- From: "Florian Barth" <bio_hazard.gmx.de>
| To: <amber.scripps.edu>
| Sent: Friday, March 18, 2005 6:25 PM
| Subject: AMBER: PMEMD with big systems
|
|
| Hi,
|
| I have some trouble running PMEMD (amber8) with big sytems like hb or
| factor_ix. For parallel runs a certain minimum number of cpus are needed
| to run those systems otherwise PMEMD will segfault. For example, PMEMD
| with factor_ix needs a minimun of 14 cpus to run (about 6500 atoms/cpus).
| PMEMD is running on a linux cluster (gentoo linux kernel 2.4.28), with
| dual athlon-mp nodes and myrinet interconnect. I used the intel ifort
| 8.1.024 compiler with the new_configure utility for compilation. Myrinet
| software is gm version 1.6.5 (compiled with gcc) and mpich-gm 1.2.6..14a
| (gcc/ifort).
| Serial PMEMD and serial/parallel sander is running without problems with
| factor_ix. Stack size is unlimited and shmem is set to 1 GB on all nodes.
| I was able to run factor_ix on 2 cpus with PMEMD during the installation
| phase of one of the nodes. But after some reboots the above limitation
| came up; unfortunately I have no idea what could have changed.
|
| Any hint would be greatly appreciated.
|
| Florian Barth
|
| ____________________________________
|
| Florian Barth
| Institute of Technical Biochemistry
| University of Stuttgart
| Allmandring 31
| 70569 Stuttgart
| Germany
| tel.:+49-711-6853811
| fax.:+49-711-6853196
| email:bio_hazard.gmx.de
| http://www.itb.uni-stuttgart.de



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFCQdYKwYcs9fJ1MJIRApfDAJ4sPxZOW+0DW4DLh9u1YLVBUv8yOgCeOnhF
ca80DnKY9drKpEgHQAgpdXg=
=bSjQ
-----END PGP SIGNATURE-----
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Mar 23 2005 - 20:53:02 PST
Custom Search