Re: AMBER: pmemd segmentation fault

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 26 Mar 2007 11:26:22 -0400

Hi Vlad -
My guess would be there may be a problem with the pmemd installation on the
big cluster. Also note, even if they give you better priority at 256+
processors, if you don't use them efficiently, you are just wasting your
compute time. On the best hardware I would not run a system like this on
more than about 256 processors if I cared about consuming my allocation, and
you will get really good efficiency and reasonable throughput at 128
processors. If this is not a high performance infiniband cluster, chances
are that running on 128 processors may not be that efficient (nothing we can
do about a relatively slow interconnect). I don't know what you mean by i8
vs. i4 versions for sure, but presume you are referring to using 64 bit
addresses vs. 32 bit addresses (the size of the default integer, 4 bytes,
should in no case be changed). There is rarely a good reason to use 64 bit
versions of the code, though that is what you get in some instances. You
need to confirm that the site is not screwing up the pmemd configuration.
Is anybody else there successfully using the pmemd install? A redhat 7.2 OS
is really old too; there may be all sorts of incompatibility issues with
newer compilers (if you have ifort compilers, you would definitely need to
crosscheck compatibility). All kinds of really bad stuff happened in the
RedHat OS lineage with regard to how threads were handled, and
incompatibilities between this stuff and the compilers created a ton of
grief around the timeframe of amber 8. I basically can't do much about the
OS and compiler vendors screwing up everything in sight other than
suggesting that you check compatibility (read the compiler release notes)
and get these guys to move forward. My three prime suggestions: 0) try the
i4 version of the code; assuming they did an i8 default integer compile, I
would expect a ton of grief (I do a bunch of bit operations on 4 byte
integers, that might not work so well on a default 8 byte integer), 1)
check out factor ix on 128 procs; if it does not run, either the site hw or
sw installation has a problem, and 2) check up on how this stuff was built -
I actually don't support redhat 7.2 anymore - heck I was running it five
years ago, and the threads model got completely changed in the interim. Why
do threads matter? I don't use them directly, but I do tons of asynchronous
mpi i/o, and asynch mpi uses them. There could be all kinds of OS/compiler
incompatibility issues causing grief (these showed up as unpredictable seg
faults - generally in the first few hundred cycles - when amber 8 was first
released). Also make sure these guys are using dynamically linked libraries
in the build - the big problems with thread stacks were in the static
libraries. I am working with vague recollections here; hopefully you will
be able to work with the systems folks there to turn up the real problem.
Regards - Bob

----- Original Message -----
From: "Vlad Cojocaru" <Vlad.Cojocaru.eml-r.villa-bosch.de>
To: <amber.scripps.edu>
Sent: Monday, March 26, 2007 10:29 AM
Subject: Re: AMBER: pmemd segmentation fault


> Dear Robert,
>
> Thanks a lot for your reply. In fact, my starting simulation system is
> relatively small (about 65.000 atoms). I did some benchmarks on my local
> system using 4CPUs and indeed pmemd9 was the faster program comparing to
> sander8, sander9, pmemd8.
>
> So, after this I got some computer time at the bigger computer facility
> and I am using this facility to do lots of different, rather long
> simulations of this system to start with before going to bigger systems by
> attaching other components to the starting system. The way the queue is
> setup there is that jobs using more than 256 processors get higher
> priority and also I have a limited amount of computer time so I am trying
> to be very efficient and as fast as possible. So I fgured out that running
> pmemd9 with 512 pocs will get my jobs finished pretty fast. Now, I know
> for sure that the simulated system is absolutely fine because it runs OK
> with sander9 on 32 procs, 128 procs, as well as on 4 procs on my local
> system. The problem has to be somewhere else. The cluster is a Linux
> cluster with 980 nodes (1960 procs), Red Hat 7.2. Details about Amber
> compilation I dont have as they are not posted. I know they have a i8 and
> i4 versions, however I didnt manage to study yet what is the difference
> between those (I am using the i8 version).
>
> Best wishes
> vlad
>
>
>
>
>
> Robert Duke wrote:
>
>> Hi Vlad,
>> I probably need more info about both the computer system and the system
>> you are simulating. How big is the simulation system? Can you run it
>> with sander or pmemd on some other smaller system? So far, all segment
>> violations on pmemd have been tracked to insufficient stacksize, but the
>> message here indicates that the hard resource limit is pretty high
>> (bottom line - this sort of thing typically occurs when the reciprocal
>> force routines run and push a bunch of stuff on the stack - thing is, the
>> more processors you use, the less the problem should be, and there is
>> always the possibility of a previously unseen bug). Okay, lets talk
>> about 512 processors. Unless your problem is really huge - over
>> 1,000,000 atoms say, I can't imagine you can effectively use all 512
>> processors. The pmemd code gets good performance via a two-pronged
>> approach: 1) first we maximize the single processor performance, and 2)
>> then we do whatever we can to parallelize well. Currently, due to
>> limitations of slab-based fft workload division, you generally are best
>> off somewhere below 512 processors (you will get throughput as good as
>> some of the competing systems that scale better, but on fewer
>> processors - and ultimately what you should care about is nsec/day
>> throughput). Anything strange about the hardware/software you are using?
>> Is it something I directly support? Is it an sgi altix (where most of
>> the stack problems seem to occur, I would guess due to some default stack
>> limits settings)? Bottom line - I need a lot more info if you actually
>> want help.
>> On sander, the stack problem is not as big a pain because sander does not
>> use nearly as much stack-based allocation (I do it in pmemd because it
>> gives slightly better performance due to page reuse - it is also a very
>> nice programming model). Sander 8, when compiled in default mode, only
>> runs on a power of two processor count; there is a #define that can
>> override this; the resultant code is probably a bit slower (the define is
>> noBTREE). I think sander 9 does not require the define; it just uses the
>> power of 2 algorithms if you have a power of 2 cpu count. Oh, but you
>> hit the 128 cpu limit - the define to bump that up is MPI_MAX_PROCESSORS
>> in parallel.h of sander 8. It is actually a pretty bad idea to try to
>> run sander on more than 128 processors though.
>> Two other notes on pmemd:
>> 1) to rule out problems with your specific simulation system, try running
>> the factor ix benchmark - say for 5000 steps, 128-256 cpu's, on your
>> system. If this works, then you know it is something about your
>> simulation system; if it doesn't, then it is something about your
>> hardware or possibly a compiler bug for the compiler used to build pmemd
>> (since factor ix is run all over the world at all sorts of processor
>> counts, correctly built pmemd on a good hardware setup is known to work).
>> 2) to get better debugging info, try running your simulation system on a
>> version of pmemd built with:
>> F90_OPT_DFLT = $(F90_OPT_DBG) in the config.h. Expect this to be really
>> really slow; you just disabled all optimizations. There may be other
>> environment variables you need to set to get more debug info, depending
>> on your compiler.
>> Regards - Bob Duke
>>
>> ----- Original Message ----- From: "Vlad Cojocaru"
>> <Vlad.Cojocaru.eml-r.villa-bosch.de>
>> To: "AMBER list" <amber.scripps.edu>
>> Sent: Monday, March 26, 2007 5:14 AM
>> Subject: AMBER: pmemd segmentation fault
>>
>>
>>> Deat Amber users,
>>>
>>> I am trying to set up some Amber runs on a large cluster. So, I switched
>>> from sander (AMEBR 8) to pmemd (AMBER 9) and I ran it on 512 processors.
>>> The job runs for 400 (out of 1.000.000) steps and then it is interrupted
>>> with the error below. In the output I get the follwoing warning:
>>> "WARNING: Stack usage limited by a hard resource limit of 4294967295
>>> bytes! If segment violations occur, get your sysadmin to increase the
>>> limit.". Could anyone advise me how to deal with this?. I should also
>>> tell you that the same job runs fine using sander (AMBER 8) on 32
>>> processors or 4 CPUs.
>>>
>>> And a second question ... when I tried sander (AMBER 8) on 256 CPUs, the
>>> job exits with an error "The number of processors must be a power of 2
>>> and no greater than 128 , but is 256". Is 128 CPUs the upper limit for
>>> sander iun AMBER 8? Does sander in AMBER 9 has the same limit ?
>>>
>>> Thanks in advance
>>>
>>> Best wishes
>>> Vlad
>>>
>>>
>>>
>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>> Image PC Routine Line
>>> Source
>>> pmemd 4000000000067010 Unknown Unknown
>>> Unknown
>>> pmemd 400000000002D8C0 Unknown Unknown
>>> Unknown
>>> pmemd 4000000000052F10 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000775B0 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000B8730 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000049D0 Unknown Unknown
>>> Unknown
>>> Unknown 20000000005913F0 Unknown Unknown
>>> Unknown
>>> pmemd 4000000000004400 Unknown Unknown
>>> Unknown
>>>
>>> Stack trace terminated abnormally.
>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>> Image PC Routine Line
>>> Source
>>> pmemd 40000000000625A0 Unknown Unknown
>>> Unknown
>>> pmemd 400000000002DA60 Unknown Unknown
>>> Unknown
>>> pmemd 4000000000052F10 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000775B0 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000B8730 Unknown Unknown
>>> Unknown
>>> pmemd 40000000000049D0 Unknown Unknown
>>> Unknown
>>> Unknown 20000000005913F0 Unknown Unknown
>>> Unknown
>>> pmemd 4000000000004400 Unknown Unknown
>>> Unknown
>>>
>>> Stack trace terminated abnormally.
>>>
>>> --
>>> ----------------------------------------------------------------------------
>>>
>>> Dr. Vlad Cojocaru
>>>
>>> EML Research gGmbH
>>> Schloss-Wolfsbrunnenweg 33
>>> 69118 Heidelberg
>>>
>>> Tel: ++49-6221-533266
>>> Fax: ++49-6221-533298
>>>
>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>
>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>
>>> ----------------------------------------------------------------------------
>>>
>>> EML Research gGmbH
>>> Amtgericht Mannheim / HRB 337446
>>> Managing Partner: Dr. h.c. Klaus Tschira
>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>> http://www.eml-r.org
>>> ----------------------------------------------------------------------------
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> The AMBER Mail Reflector
>>> To post, send mail to amber.scripps.edu
>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Mar 28 2007 - 06:07:26 PDT
Custom Search