Re: AMBER: pmemd segmentation fault

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 26 Mar 2007 09:25:36 -0400

Hi Vlad,
I probably need more info about both the computer system and the system you
are simulating. How big is the simulation system? Can you run it with
sander or pmemd on some other smaller system? So far, all segment
violations on pmemd have been tracked to insufficient stacksize, but the
message here indicates that the hard resource limit is pretty high (bottom
line - this sort of thing typically occurs when the reciprocal force
routines run and push a bunch of stuff on the stack - thing is, the more
processors you use, the less the problem should be, and there is always the
possibility of a previously unseen bug). Okay, lets talk about 512
processors. Unless your problem is really huge - over 1,000,000 atoms say,
I can't imagine you can effectively use all 512 processors. The pmemd code
gets good performance via a two-pronged approach: 1) first we maximize the
single processor performance, and 2) then we do whatever we can to
parallelize well. Currently, due to limitations of slab-based fft workload
division, you generally are best off somewhere below 512 processors (you
will get throughput as good as some of the competing systems that scale
better, but on fewer processors - and ultimately what you should care about
is nsec/day throughput). Anything strange about the hardware/software you
are using? Is it something I directly support? Is it an sgi altix (where
most of the stack problems seem to occur, I would guess due to some default
stack limits settings)? Bottom line - I need a lot more info if you
actually want help.
On sander, the stack problem is not as big a pain because sander does not
use nearly as much stack-based allocation (I do it in pmemd because it gives
slightly better performance due to page reuse - it is also a very nice
programming model). Sander 8, when compiled in default mode, only runs on a
power of two processor count; there is a #define that can override this; the
resultant code is probably a bit slower (the define is noBTREE). I think
sander 9 does not require the define; it just uses the power of 2 algorithms
if you have a power of 2 cpu count. Oh, but you hit the 128 cpu limit - the
define to bump that up is MPI_MAX_PROCESSORS in parallel.h of sander 8. It
is actually a pretty bad idea to try to run sander on more than 128
processors though.
Two other notes on pmemd:
1) to rule out problems with your specific simulation system, try running
the factor ix benchmark - say for 5000 steps, 128-256 cpu's, on your system.
If this works, then you know it is something about your simulation system;
if it doesn't, then it is something about your hardware or possibly a
compiler bug for the compiler used to build pmemd (since factor ix is run
all over the world at all sorts of processor counts, correctly built pmemd
on a good hardware setup is known to work).
2) to get better debugging info, try running your simulation system on a
version of pmemd built with:
F90_OPT_DFLT = $(F90_OPT_DBG) in the config.h. Expect this to be really
really slow; you just disabled all optimizations. There may be other
environment variables you need to set to get more debug info, depending on
your compiler.
Regards - Bob Duke

----- Original Message -----
From: "Vlad Cojocaru" <Vlad.Cojocaru.eml-r.villa-bosch.de>
To: "AMBER list" <amber.scripps.edu>
Sent: Monday, March 26, 2007 5:14 AM
Subject: AMBER: pmemd segmentation fault


> Deat Amber users,
>
> I am trying to set up some Amber runs on a large cluster. So, I switched
> from sander (AMEBR 8) to pmemd (AMBER 9) and I ran it on 512 processors.
> The job runs for 400 (out of 1.000.000) steps and then it is interrupted
> with the error below. In the output I get the follwoing warning:
> "WARNING: Stack usage limited by a hard resource limit of 4294967295
> bytes! If segment violations occur, get your sysadmin to increase the
> limit.". Could anyone advise me how to deal with this?. I should also tell
> you that the same job runs fine using sander (AMBER 8) on 32 processors or
> 4 CPUs.
>
> And a second question ... when I tried sander (AMBER 8) on 256 CPUs, the
> job exits with an error "The number of processors must be a power of 2 and
> no greater than 128 , but is 256". Is 128 CPUs the upper limit for sander
> iun AMBER 8? Does sander in AMBER 9 has the same limit ?
>
> Thanks in advance
>
> Best wishes
> Vlad
>
>
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> pmemd 4000000000067010 Unknown Unknown
> Unknown
> pmemd 400000000002D8C0 Unknown Unknown
> Unknown
> pmemd 4000000000052F10 Unknown Unknown
> Unknown
> pmemd 40000000000775B0 Unknown Unknown
> Unknown
> pmemd 40000000000B8730 Unknown Unknown
> Unknown
> pmemd 40000000000049D0 Unknown Unknown
> Unknown
> Unknown 20000000005913F0 Unknown Unknown
> Unknown
> pmemd 4000000000004400 Unknown Unknown
> Unknown
>
> Stack trace terminated abnormally.
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> pmemd 40000000000625A0 Unknown Unknown
> Unknown
> pmemd 400000000002DA60 Unknown Unknown
> Unknown
> pmemd 4000000000052F10 Unknown Unknown
> Unknown
> pmemd 40000000000775B0 Unknown Unknown
> Unknown
> pmemd 40000000000B8730 Unknown Unknown
> Unknown
> pmemd 40000000000049D0 Unknown Unknown
> Unknown
> Unknown 20000000005913F0 Unknown Unknown
> Unknown
> pmemd 4000000000004400 Unknown Unknown
> Unknown
>
> Stack trace terminated abnormally.
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Mar 28 2007 - 06:07:24 PDT
Custom Search