Re: AMBER: pmemd segmentation fault from Vlad Cojocaru on 2007-03-26 (Amber Archive Mar 2007)

From: Vlad Cojocaru <Vlad.Cojocaru.eml-r.villa-bosch.de>
Date: Mon, 26 Mar 2007 16:29:12 +0200

Dear Robert,

Thanks a lot for your reply. In fact, my starting simulation system is
relatively small (about 65.000 atoms). I did some benchmarks on my local
system using 4CPUs and indeed pmemd9 was the faster program comparing to
sander8, sander9, pmemd8.

So, after this I got some computer time at the bigger computer facility
and I am using this facility to do lots of different, rather long
simulations of this system to start with before going to bigger systems
by attaching other components to the starting system. The way the queue
is setup there is that jobs using more than 256 processors get higher
priority and also I have a limited amount of computer time so I am
trying to be very efficient and as fast as possible. So I fgured out
that running pmemd9 with 512 pocs will get my jobs finished pretty fast.
Now, I know for sure that the simulated system is absolutely fine
because it runs OK with sander9 on 32 procs, 128 procs, as well as on 4
procs on my local system. The problem has to be somewhere else. The
cluster is a Linux cluster with 980 nodes (1960 procs), Red Hat 7.2.
Details about Amber compilation I dont have as they are not posted. I
know they have a i8 and i4 versions, however I didnt manage to study yet
what is the difference between those (I am using the i8 version).

Best wishes
vlad

Robert Duke wrote:

> Hi Vlad,
> I probably need more info about both the computer system and the
> system you are simulating. How big is the simulation system? Can you
> run it with sander or pmemd on some other smaller system? So far, all
> segment violations on pmemd have been tracked to insufficient
> stacksize, but the message here indicates that the hard resource limit
> is pretty high (bottom line - this sort of thing typically occurs when
> the reciprocal force routines run and push a bunch of stuff on the
> stack - thing is, the more processors you use, the less the problem
> should be, and there is always the possibility of a previously unseen
> bug). Okay, lets talk about 512 processors. Unless your problem is
> really huge - over 1,000,000 atoms say, I can't imagine you can
> effectively use all 512 processors. The pmemd code gets good
> performance via a two-pronged approach: 1) first we maximize the
> single processor performance, and 2) then we do whatever we can to
> parallelize well. Currently, due to limitations of slab-based fft
> workload division, you generally are best off somewhere below 512
> processors (you will get throughput as good as some of the competing
> systems that scale better, but on fewer processors - and ultimately
> what you should care about is nsec/day throughput). Anything strange
> about the hardware/software you are using? Is it something I directly
> support? Is it an sgi altix (where most of the stack problems seem to
> occur, I would guess due to some default stack limits settings)?
> Bottom line - I need a lot more info if you actually want help.
> On sander, the stack problem is not as big a pain because sander does
> not use nearly as much stack-based allocation (I do it in pmemd
> because it gives slightly better performance due to page reuse - it is
> also a very nice programming model). Sander 8, when compiled in
> default mode, only runs on a power of two processor count; there is a
> #define that can override this; the resultant code is probably a bit
> slower (the define is noBTREE). I think sander 9 does not require the
> define; it just uses the power of 2 algorithms if you have a power of
> 2 cpu count. Oh, but you hit the 128 cpu limit - the define to bump
> that up is MPI_MAX_PROCESSORS in parallel.h of sander 8. It is
> actually a pretty bad idea to try to run sander on more than 128
> processors though.
> Two other notes on pmemd:
> 1) to rule out problems with your specific simulation system, try
> running the factor ix benchmark - say for 5000 steps, 128-256 cpu's,
> on your system. If this works, then you know it is something about
> your simulation system; if it doesn't, then it is something about your
> hardware or possibly a compiler bug for the compiler used to build
> pmemd (since factor ix is run all over the world at all sorts of
> processor counts, correctly built pmemd on a good hardware setup is
> known to work).
> 2) to get better debugging info, try running your simulation system on
> a version of pmemd built with:
> F90_OPT_DFLT = $(F90_OPT_DBG) in the config.h. Expect this to be
> really really slow; you just disabled all optimizations. There may be
> other environment variables you need to set to get more debug info,
> depending on your compiler.
> Regards - Bob Duke
>
> ----- Original Message ----- From: "Vlad Cojocaru"
> <Vlad.Cojocaru.eml-r.villa-bosch.de>
> To: "AMBER list" <amber.scripps.edu>
> Sent: Monday, March 26, 2007 5:14 AM
> Subject: AMBER: pmemd segmentation fault
>
>
>> Deat Amber users,
>>
>> I am trying to set up some Amber runs on a large cluster. So, I
>> switched from sander (AMEBR 8) to pmemd (AMBER 9) and I ran it on 512
>> processors. The job runs for 400 (out of 1.000.000) steps and then it
>> is interrupted with the error below. In the output I get the
>> follwoing warning: "WARNING: Stack usage limited by a hard resource
>> limit of 4294967295 bytes! If segment violations occur, get your
>> sysadmin to increase the limit.". Could anyone advise me how to deal
>> with this?. I should also tell you that the same job runs fine using
>> sander (AMBER 8) on 32 processors or 4 CPUs.
>>
>> And a second question ... when I tried sander (AMBER 8) on 256 CPUs,
>> the job exits with an error "The number of processors must be a power
>> of 2 and no greater than 128 , but is 256". Is 128 CPUs the upper
>> limit for sander iun AMBER 8? Does sander in AMBER 9 has the same
>> limit ?
>>
>> Thanks in advance
>>
>> Best wishes
>> Vlad
>>
>>
>>
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line
>> Source
>> pmemd 4000000000067010 Unknown Unknown
>> Unknown
>> pmemd 400000000002D8C0 Unknown Unknown
>> Unknown
>> pmemd 4000000000052F10 Unknown Unknown
>> Unknown
>> pmemd 40000000000775B0 Unknown Unknown
>> Unknown
>> pmemd 40000000000B8730 Unknown Unknown
>> Unknown
>> pmemd 40000000000049D0 Unknown Unknown
>> Unknown
>> Unknown 20000000005913F0 Unknown Unknown
>> Unknown
>> pmemd 4000000000004400 Unknown Unknown
>> Unknown
>>
>> Stack trace terminated abnormally.
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line
>> Source
>> pmemd 40000000000625A0 Unknown Unknown
>> Unknown
>> pmemd 400000000002DA60 Unknown Unknown
>> Unknown
>> pmemd 4000000000052F10 Unknown Unknown
>> Unknown
>> pmemd 40000000000775B0 Unknown Unknown
>> Unknown
>> pmemd 40000000000B8730 Unknown Unknown
>> Unknown
>> pmemd 40000000000049D0 Unknown Unknown
>> Unknown
>> Unknown 20000000005913F0 Unknown Unknown
>> Unknown
>> pmemd 4000000000004400 Unknown Unknown
>> Unknown
>>
>> Stack trace terminated abnormally.
>>
>> --
>> ----------------------------------------------------------------------------
>>
>> Dr. Vlad Cojocaru
>>
>> EML Research gGmbH
>> Schloss-Wolfsbrunnenweg 33
>> 69118 Heidelberg
>>
>> Tel: ++49-6221-533266
>> Fax: ++49-6221-533298
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>> ----------------------------------------------------------------------------
>>
>> EML Research gGmbH
>> Amtgericht Mannheim / HRB 337446
>> Managing Partner: Dr. h.c. Klaus Tschira
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>> http://www.eml-r.org
>> ----------------------------------------------------------------------------
>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Wed Mar 28 2007 - 06:07:26 PDT