Re: AMBER: pmemd segmentation fault from Vlad Cojocaru on 2007-03-26 (Amber Archive Mar 2007)

From: Vlad Cojocaru <Vlad.Cojocaru.eml-r.villa-bosch.de>
Date: Mon, 26 Mar 2007 18:43:52 +0200

Hi Bob,

Great thanks for all this info. It will help a lot when I'll try to ask
the people responsible at the facility which is actually the "PNNL
Molecular Science Computing Facility" (http://mscf.emsl.pnl.gov/about/).
Does anybody from the amber list have experience with running amber on
this facility??.I've just submitted a trial job using the i4 version
(pmemd) on 256 CPUs (the 256 CPUs trial job on i8 version failed as well
with the same error as the 512 CPUs). Lets see what's happening. If it
doesnt run I will try to get in touch with the person who built amber9
there and see if something can be done. I dont know anybody that has
managed to run pmemd there, but maybe I'll get some feed-back from the
amber list. Well, if nothing can be done I will just stick to sander9 on
128 CPUs which seems to run fine with a predicted output of 2ns/21
hours which seems to be much slower than the benchmarks described on
the amber manual for pmemd, but I guess its fine.

Best wishes
vlad

Robert Duke wrote:

> Hi Vlad -
> My guess would be there may be a problem with the pmemd installation
> on the big cluster. Also note, even if they give you better priority
> at 256+ processors, if you don't use them efficiently, you are just
> wasting your compute time. On the best hardware I would not run a
> system like this on more than about 256 processors if I cared about
> consuming my allocation, and you will get really good efficiency and
> reasonable throughput at 128 processors. If this is not a high
> performance infiniband cluster, chances are that running on 128
> processors may not be that efficient (nothing we can do about a
> relatively slow interconnect). I don't know what you mean by i8 vs.
> i4 versions for sure, but presume you are referring to using 64 bit
> addresses vs. 32 bit addresses (the size of the default integer, 4
> bytes, should in no case be changed). There is rarely a good reason
> to use 64 bit versions of the code, though that is what you get in
> some instances. You need to confirm that the site is not screwing up
> the pmemd configuration. Is anybody else there successfully using the
> pmemd install? A redhat 7.2 OS is really old too; there may be all
> sorts of incompatibility issues with newer compilers (if you have
> ifort compilers, you would definitely need to crosscheck
> compatibility). All kinds of really bad stuff happened in the RedHat
> OS lineage with regard to how threads were handled, and
> incompatibilities between this stuff and the compilers created a ton
> of grief around the timeframe of amber 8. I basically can't do much
> about the OS and compiler vendors screwing up everything in sight
> other than suggesting that you check compatibility (read the compiler
> release notes) and get these guys to move forward. My three prime
> suggestions: 0) try the i4 version of the code; assuming they did an
> i8 default integer compile, I would expect a ton of grief (I do a
> bunch of bit operations on 4 byte integers, that might not work so
> well on a default 8 byte integer), 1) check out factor ix on 128
> procs; if it does not run, either the site hw or sw installation has a
> problem, and 2) check up on how this stuff was built - I actually
> don't support redhat 7.2 anymore - heck I was running it five years
> ago, and the threads model got completely changed in the interim. Why
> do threads matter? I don't use them directly, but I do tons of
> asynchronous mpi i/o, and asynch mpi uses them. There could be all
> kinds of OS/compiler incompatibility issues causing grief (these
> showed up as unpredictable seg faults - generally in the first few
> hundred cycles - when amber 8 was first released). Also make sure
> these guys are using dynamically linked libraries in the build - the
> big problems with thread stacks were in the static libraries. I am
> working with vague recollections here; hopefully you will be able to
> work with the systems folks there to turn up the real problem.
> Regards - Bob
>
> ----- Original Message ----- From: "Vlad Cojocaru"
> <Vlad.Cojocaru.eml-r.villa-bosch.de>
> To: <amber.scripps.edu>
> Sent: Monday, March 26, 2007 10:29 AM
> Subject: Re: AMBER: pmemd segmentation fault
>
>
>> Dear Robert,
>>
>> Thanks a lot for your reply. In fact, my starting simulation system
>> is relatively small (about 65.000 atoms). I did some benchmarks on my
>> local system using 4CPUs and indeed pmemd9 was the faster program
>> comparing to sander8, sander9, pmemd8.
>>
>> So, after this I got some computer time at the bigger computer
>> facility and I am using this facility to do lots of different, rather
>> long simulations of this system to start with before going to bigger
>> systems by attaching other components to the starting system. The way
>> the queue is setup there is that jobs using more than 256 processors
>> get higher priority and also I have a limited amount of computer time
>> so I am trying to be very efficient and as fast as possible. So I
>> fgured out that running pmemd9 with 512 pocs will get my jobs
>> finished pretty fast. Now, I know for sure that the simulated system
>> is absolutely fine because it runs OK with sander9 on 32 procs, 128
>> procs, as well as on 4 procs on my local system. The problem has to
>> be somewhere else. The cluster is a Linux cluster with 980 nodes
>> (1960 procs), Red Hat 7.2. Details about Amber compilation I dont
>> have as they are not posted. I know they have a i8 and i4 versions,
>> however I didnt manage to study yet what is the difference between
>> those (I am using the i8 version).
>>
>> Best wishes
>> vlad
>>
>>
>>
>>
>>
>> Robert Duke wrote:
>>
>>> Hi Vlad,
>>> I probably need more info about both the computer system and the
>>> system you are simulating. How big is the simulation system? Can
>>> you run it with sander or pmemd on some other smaller system? So
>>> far, all segment violations on pmemd have been tracked to
>>> insufficient stacksize, but the message here indicates that the hard
>>> resource limit is pretty high (bottom line - this sort of thing
>>> typically occurs when the reciprocal force routines run and push a
>>> bunch of stuff on the stack - thing is, the more processors you use,
>>> the less the problem should be, and there is always the possibility
>>> of a previously unseen bug). Okay, lets talk about 512 processors.
>>> Unless your problem is really huge - over 1,000,000 atoms say, I
>>> can't imagine you can effectively use all 512 processors. The pmemd
>>> code gets good performance via a two-pronged approach: 1) first we
>>> maximize the single processor performance, and 2) then we do
>>> whatever we can to parallelize well. Currently, due to limitations
>>> of slab-based fft workload division, you generally are best off
>>> somewhere below 512 processors (you will get throughput as good as
>>> some of the competing systems that scale better, but on fewer
>>> processors - and ultimately what you should care about is nsec/day
>>> throughput). Anything strange about the hardware/software you are
>>> using? Is it something I directly support? Is it an sgi altix
>>> (where most of the stack problems seem to occur, I would guess due
>>> to some default stack limits settings)? Bottom line - I need a lot
>>> more info if you actually want help.
>>> On sander, the stack problem is not as big a pain because sander
>>> does not use nearly as much stack-based allocation (I do it in pmemd
>>> because it gives slightly better performance due to page reuse - it
>>> is also a very nice programming model). Sander 8, when compiled in
>>> default mode, only runs on a power of two processor count; there is
>>> a #define that can override this; the resultant code is probably a
>>> bit slower (the define is noBTREE). I think sander 9 does not
>>> require the define; it just uses the power of 2 algorithms if you
>>> have a power of 2 cpu count. Oh, but you hit the 128 cpu limit -
>>> the define to bump that up is MPI_MAX_PROCESSORS in parallel.h of
>>> sander 8. It is actually a pretty bad idea to try to run sander on
>>> more than 128 processors though.
>>> Two other notes on pmemd:
>>> 1) to rule out problems with your specific simulation system, try
>>> running the factor ix benchmark - say for 5000 steps, 128-256 cpu's,
>>> on your system. If this works, then you know it is something about
>>> your simulation system; if it doesn't, then it is something about
>>> your hardware or possibly a compiler bug for the compiler used to
>>> build pmemd (since factor ix is run all over the world at all sorts
>>> of processor counts, correctly built pmemd on a good hardware setup
>>> is known to work).
>>> 2) to get better debugging info, try running your simulation system
>>> on a version of pmemd built with:
>>> F90_OPT_DFLT = $(F90_OPT_DBG) in the config.h. Expect this to be
>>> really really slow; you just disabled all optimizations. There may
>>> be other environment variables you need to set to get more debug
>>> info, depending on your compiler.
>>> Regards - Bob Duke
>>>
>>> ----- Original Message ----- From: "Vlad Cojocaru"
>>> <Vlad.Cojocaru.eml-r.villa-bosch.de>
>>> To: "AMBER list" <amber.scripps.edu>
>>> Sent: Monday, March 26, 2007 5:14 AM
>>> Subject: AMBER: pmemd segmentation fault
>>>
>>>
>>>> Deat Amber users,
>>>>
>>>> I am trying to set up some Amber runs on a large cluster. So, I
>>>> switched from sander (AMEBR 8) to pmemd (AMBER 9) and I ran it on
>>>> 512 processors. The job runs for 400 (out of 1.000.000) steps and
>>>> then it is interrupted with the error below. In the output I get
>>>> the follwoing warning: "WARNING: Stack usage limited by a hard
>>>> resource limit of 4294967295 bytes! If segment violations occur,
>>>> get your sysadmin to increase the limit.". Could anyone advise me
>>>> how to deal with this?. I should also tell you that the same job
>>>> runs fine using sander (AMBER 8) on 32 processors or 4 CPUs.
>>>>
>>>> And a second question ... when I tried sander (AMBER 8) on 256
>>>> CPUs, the job exits with an error "The number of processors must be
>>>> a power of 2 and no greater than 128 , but is 256". Is 128 CPUs the
>>>> upper limit for sander iun AMBER 8? Does sander in AMBER 9 has the
>>>> same limit ?
>>>>
>>>> Thanks in advance
>>>>
>>>> Best wishes
>>>> Vlad
>>>>
>>>>
>>>>
>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>> Image PC Routine Line Source
>>>> pmemd 4000000000067010 Unknown Unknown
>>>> Unknown
>>>> pmemd 400000000002D8C0 Unknown Unknown
>>>> Unknown
>>>> pmemd 4000000000052F10 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000775B0 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000B8730 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000049D0 Unknown Unknown
>>>> Unknown
>>>> Unknown 20000000005913F0 Unknown Unknown
>>>> Unknown
>>>> pmemd 4000000000004400 Unknown Unknown
>>>> Unknown
>>>>
>>>> Stack trace terminated abnormally.
>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>> Image PC Routine Line Source
>>>> pmemd 40000000000625A0 Unknown Unknown
>>>> Unknown
>>>> pmemd 400000000002DA60 Unknown Unknown
>>>> Unknown
>>>> pmemd 4000000000052F10 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000775B0 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000B8730 Unknown Unknown
>>>> Unknown
>>>> pmemd 40000000000049D0 Unknown Unknown
>>>> Unknown
>>>> Unknown 20000000005913F0 Unknown Unknown
>>>> Unknown
>>>> pmemd 4000000000004400 Unknown Unknown
>>>> Unknown
>>>>
>>>> Stack trace terminated abnormally.
>>>>
>>>> --
>>>> ----------------------------------------------------------------------------
>>>>
>>>>
>>>> Dr. Vlad Cojocaru
>>>>
>>>> EML Research gGmbH
>>>> Schloss-Wolfsbrunnenweg 33
>>>> 69118 Heidelberg
>>>>
>>>> Tel: ++49-6221-533266
>>>> Fax: ++49-6221-533298
>>>>
>>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>>
>>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>>
>>>> ----------------------------------------------------------------------------
>>>>
>>>>
>>>> EML Research gGmbH
>>>> Amtgericht Mannheim / HRB 337446
>>>> Managing Partner: Dr. h.c. Klaus Tschira
>>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>>> http://www.eml-r.org
>>>> ----------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>>
>>>> The AMBER Mail Reflector
>>>> To post, send mail to amber.scripps.edu
>>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> The AMBER Mail Reflector
>>> To post, send mail to amber.scripps.edu
>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>
>>
>> --
>> ----------------------------------------------------------------------------
>>
>> Dr. Vlad Cojocaru
>>
>> EML Research gGmbH
>> Schloss-Wolfsbrunnenweg 33
>> 69118 Heidelberg
>>
>> Tel: ++49-6221-533266
>> Fax: ++49-6221-533298
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>> ----------------------------------------------------------------------------
>>
>> EML Research gGmbH
>> Amtgericht Mannheim / HRB 337446
>> Managing Partner: Dr. h.c. Klaus Tschira
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>> http://www.eml-r.org
>> ----------------------------------------------------------------------------
>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533266
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Wed Mar 28 2007 - 06:07:28 PDT