Re: AMBER: pmemd segmentation fault from Robert Duke on 2007-03-26 (Amber Archive Mar 2007)

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 26 Mar 2007 13:16:08 -0400

Hi Vlad -
After sending the last mail, I am really drifting toward it likely being an
"i8" build problem; this is somebody there getting creative for no good
reason (it doesn't help with anything). There is no reason that a good
pmemd build should not be possible; just may require getting in touch with
the right people. I would actually bet that i4 will work, unless they
indeed have some of the other stuff going on. It is worthwhile to get pmemd
working because at this sort of facility you can get upwards of 4 times as
much throughput at the higher processor counts.
Best - Bob

----- Original Message -----
From: "Vlad Cojocaru" <Vlad.Cojocaru.eml-r.villa-bosch.de>
To: <amber.scripps.edu>
Sent: Monday, March 26, 2007 12:43 PM
Subject: Re: AMBER: pmemd segmentation fault

> Hi Bob,
>
> Great thanks for all this info. It will help a lot when I'll try to ask
> the people responsible at the facility which is actually the "PNNL
> Molecular Science Computing Facility" (http://mscf.emsl.pnl.gov/about/).
> Does anybody from the amber list have experience with running amber on
> this facility??.I've just submitted a trial job using the i4 version
> (pmemd) on 256 CPUs (the 256 CPUs trial job on i8 version failed as well
> with the same error as the 512 CPUs). Lets see what's happening. If it
> doesnt run I will try to get in touch with the person who built amber9
> there and see if something can be done. I dont know anybody that has
> managed to run pmemd there, but maybe I'll get some feed-back from the
> amber list. Well, if nothing can be done I will just stick to sander9 on
> 128 CPUs which seems to run fine with a predicted output of 2ns/21 hours
> which seems to be much slower than the benchmarks described on the amber
> manual for pmemd, but I guess its fine.
>
>
> Best wishes
> vlad
>
>
>
> Robert Duke wrote:
>
>> Hi Vlad -
>> My guess would be there may be a problem with the pmemd installation on
>> the big cluster. Also note, even if they give you better priority at
>> 256+ processors, if you don't use them efficiently, you are just wasting
>> your compute time. On the best hardware I would not run a system like
>> this on more than about 256 processors if I cared about consuming my
>> allocation, and you will get really good efficiency and reasonable
>> throughput at 128 processors. If this is not a high performance
>> infiniband cluster, chances are that running on 128 processors may not be
>> that efficient (nothing we can do about a relatively slow interconnect).
>> I don't know what you mean by i8 vs. i4 versions for sure, but presume
>> you are referring to using 64 bit addresses vs. 32 bit addresses (the
>> size of the default integer, 4 bytes, should in no case be changed).
>> There is rarely a good reason to use 64 bit versions of the code, though
>> that is what you get in some instances. You need to confirm that the
>> site is not screwing up the pmemd configuration. Is anybody else there
>> successfully using the pmemd install? A redhat 7.2 OS is really old too;
>> there may be all sorts of incompatibility issues with newer compilers (if
>> you have ifort compilers, you would definitely need to crosscheck
>> compatibility). All kinds of really bad stuff happened in the RedHat OS
>> lineage with regard to how threads were handled, and incompatibilities
>> between this stuff and the compilers created a ton of grief around the
>> timeframe of amber 8. I basically can't do much about the OS and
>> compiler vendors screwing up everything in sight other than suggesting
>> that you check compatibility (read the compiler release notes) and get
>> these guys to move forward. My three prime suggestions: 0) try the i4
>> version of the code; assuming they did an i8 default integer compile, I
>> would expect a ton of grief (I do a bunch of bit operations on 4 byte
>> integers, that might not work so well on a default 8 byte integer), 1)
>> check out factor ix on 128 procs; if it does not run, either the site hw
>> or sw installation has a problem, and 2) check up on how this stuff was
>> built - I actually don't support redhat 7.2 anymore - heck I was running
>> it five years ago, and the threads model got completely changed in the
>> interim. Why do threads matter? I don't use them directly, but I do
>> tons of asynchronous mpi i/o, and asynch mpi uses them. There could be
>> all kinds of OS/compiler incompatibility issues causing grief (these
>> showed up as unpredictable seg faults - generally in the first few
>> hundred cycles - when amber 8 was first released). Also make sure these
>> guys are using dynamically linked libraries in the build - the big
>> problems with thread stacks were in the static libraries. I am working
>> with vague recollections here; hopefully you will be able to work with
>> the systems folks there to turn up the real problem.
>> Regards - Bob
>>
>> ----- Original Message ----- From: "Vlad Cojocaru"
>> <Vlad.Cojocaru.eml-r.villa-bosch.de>
>> To: <amber.scripps.edu>
>> Sent: Monday, March 26, 2007 10:29 AM
>> Subject: Re: AMBER: pmemd segmentation fault
>>
>>
>>> Dear Robert,
>>>
>>> Thanks a lot for your reply. In fact, my starting simulation system is
>>> relatively small (about 65.000 atoms). I did some benchmarks on my local
>>> system using 4CPUs and indeed pmemd9 was the faster program comparing to
>>> sander8, sander9, pmemd8.
>>>
>>> So, after this I got some computer time at the bigger computer facility
>>> and I am using this facility to do lots of different, rather long
>>> simulations of this system to start with before going to bigger systems
>>> by attaching other components to the starting system. The way the queue
>>> is setup there is that jobs using more than 256 processors get higher
>>> priority and also I have a limited amount of computer time so I am
>>> trying to be very efficient and as fast as possible. So I fgured out
>>> that running pmemd9 with 512 pocs will get my jobs finished pretty fast.
>>> Now, I know for sure that the simulated system is absolutely fine
>>> because it runs OK with sander9 on 32 procs, 128 procs, as well as on 4
>>> procs on my local system. The problem has to be somewhere else. The
>>> cluster is a Linux cluster with 980 nodes (1960 procs), Red Hat 7.2.
>>> Details about Amber compilation I dont have as they are not posted. I
>>> know they have a i8 and i4 versions, however I didnt manage to study yet
>>> what is the difference between those (I am using the i8 version).
>>>
>>> Best wishes
>>> vlad
>>>
>>>
>>>
>>>
>>>
>>> Robert Duke wrote:
>>>
>>>> Hi Vlad,
>>>> I probably need more info about both the computer system and the system
>>>> you are simulating. How big is the simulation system? Can you run it
>>>> with sander or pmemd on some other smaller system? So far, all segment
>>>> violations on pmemd have been tracked to insufficient stacksize, but
>>>> the message here indicates that the hard resource limit is pretty high
>>>> (bottom line - this sort of thing typically occurs when the reciprocal
>>>> force routines run and push a bunch of stuff on the stack - thing is,
>>>> the more processors you use, the less the problem should be, and there
>>>> is always the possibility of a previously unseen bug). Okay, lets talk
>>>> about 512 processors. Unless your problem is really huge - over
>>>> 1,000,000 atoms say, I can't imagine you can effectively use all 512
>>>> processors. The pmemd code gets good performance via a two-pronged
>>>> approach: 1) first we maximize the single processor performance, and 2)
>>>> then we do whatever we can to parallelize well. Currently, due to
>>>> limitations of slab-based fft workload division, you generally are best
>>>> off somewhere below 512 processors (you will get throughput as good as
>>>> some of the competing systems that scale better, but on fewer
>>>> processors - and ultimately what you should care about is nsec/day
>>>> throughput). Anything strange about the hardware/software you are
>>>> using? Is it something I directly support? Is it an sgi altix (where
>>>> most of the stack problems seem to occur, I would guess due to some
>>>> default stack limits settings)? Bottom line - I need a lot more info
>>>> if you actually want help.
>>>> On sander, the stack problem is not as big a pain because sander does
>>>> not use nearly as much stack-based allocation (I do it in pmemd because
>>>> it gives slightly better performance due to page reuse - it is also a
>>>> very nice programming model). Sander 8, when compiled in default mode,
>>>> only runs on a power of two processor count; there is a #define that
>>>> can override this; the resultant code is probably a bit slower (the
>>>> define is noBTREE). I think sander 9 does not require the define; it
>>>> just uses the power of 2 algorithms if you have a power of 2 cpu count.
>>>> Oh, but you hit the 128 cpu limit - the define to bump that up is
>>>> MPI_MAX_PROCESSORS in parallel.h of sander 8. It is actually a pretty
>>>> bad idea to try to run sander on more than 128 processors though.
>>>> Two other notes on pmemd:
>>>> 1) to rule out problems with your specific simulation system, try
>>>> running the factor ix benchmark - say for 5000 steps, 128-256 cpu's, on
>>>> your system. If this works, then you know it is something about your
>>>> simulation system; if it doesn't, then it is something about your
>>>> hardware or possibly a compiler bug for the compiler used to build
>>>> pmemd (since factor ix is run all over the world at all sorts of
>>>> processor counts, correctly built pmemd on a good hardware setup is
>>>> known to work).
>>>> 2) to get better debugging info, try running your simulation system on
>>>> a version of pmemd built with:
>>>> F90_OPT_DFLT = $(F90_OPT_DBG) in the config.h. Expect this to be
>>>> really really slow; you just disabled all optimizations. There may be
>>>> other environment variables you need to set to get more debug info,
>>>> depending on your compiler.
>>>> Regards - Bob Duke
>>>>
>>>> ----- Original Message ----- From: "Vlad Cojocaru"
>>>> <Vlad.Cojocaru.eml-r.villa-bosch.de>
>>>> To: "AMBER list" <amber.scripps.edu>
>>>> Sent: Monday, March 26, 2007 5:14 AM
>>>> Subject: AMBER: pmemd segmentation fault
>>>>
>>>>
>>>>> Deat Amber users,
>>>>>
>>>>> I am trying to set up some Amber runs on a large cluster. So, I
>>>>> switched from sander (AMEBR 8) to pmemd (AMBER 9) and I ran it on 512
>>>>> processors. The job runs for 400 (out of 1.000.000) steps and then it
>>>>> is interrupted with the error below. In the output I get the
>>>>> follwoing warning: "WARNING: Stack usage limited by a hard resource
>>>>> limit of 4294967295 bytes! If segment violations occur, get your
>>>>> sysadmin to increase the limit.". Could anyone advise me how to deal
>>>>> with this?. I should also tell you that the same job runs fine using
>>>>> sander (AMBER 8) on 32 processors or 4 CPUs.
>>>>>
>>>>> And a second question ... when I tried sander (AMBER 8) on 256 CPUs,
>>>>> the job exits with an error "The number of processors must be a power
>>>>> of 2 and no greater than 128 , but is 256". Is 128 CPUs the upper
>>>>> limit for sander iun AMBER 8? Does sander in AMBER 9 has the same
>>>>> limit ?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Best wishes
>>>>> Vlad
>>>>>
>>>>>
>>>>>
>>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>>> Image PC Routine Line Source
>>>>> pmemd 4000000000067010 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 400000000002D8C0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 4000000000052F10 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000775B0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000B8730 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000049D0 Unknown Unknown
>>>>> Unknown
>>>>> Unknown 20000000005913F0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 4000000000004400 Unknown Unknown
>>>>> Unknown
>>>>>
>>>>> Stack trace terminated abnormally.
>>>>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>>>>> Image PC Routine Line Source
>>>>> pmemd 40000000000625A0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 400000000002DA60 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 4000000000052F10 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000775B0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000B8730 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 40000000000049D0 Unknown Unknown
>>>>> Unknown
>>>>> Unknown 20000000005913F0 Unknown Unknown
>>>>> Unknown
>>>>> pmemd 4000000000004400 Unknown Unknown
>>>>> Unknown
>>>>>
>>>>> Stack trace terminated abnormally.
>>>>>
>>>>> --
>>>>> ----------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Dr. Vlad Cojocaru
>>>>>
>>>>> EML Research gGmbH
>>>>> Schloss-Wolfsbrunnenweg 33
>>>>> 69118 Heidelberg
>>>>>
>>>>> Tel: ++49-6221-533266
>>>>> Fax: ++49-6221-533298
>>>>>
>>>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>>>
>>>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>>>
>>>>> ----------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> EML Research gGmbH
>>>>> Amtgericht Mannheim / HRB 337446
>>>>> Managing Partner: Dr. h.c. Klaus Tschira
>>>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>>>> http://www.eml-r.org
>>>>> ----------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>> The AMBER Mail Reflector
>>>>> To post, send mail to amber.scripps.edu
>>>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> The AMBER Mail Reflector
>>>> To post, send mail to amber.scripps.edu
>>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>>
>>>
>>> --
>>> ----------------------------------------------------------------------------
>>>
>>> Dr. Vlad Cojocaru
>>>
>>> EML Research gGmbH
>>> Schloss-Wolfsbrunnenweg 33
>>> 69118 Heidelberg
>>>
>>> Tel: ++49-6221-533266
>>> Fax: ++49-6221-533298
>>>
>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>
>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>
>>> ----------------------------------------------------------------------------
>>>
>>> EML Research gGmbH
>>> Amtgericht Mannheim / HRB 337446
>>> Managing Partner: Dr. h.c. Klaus Tschira
>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>> http://www.eml-r.org
>>> ----------------------------------------------------------------------------
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> The AMBER Mail Reflector
>>> To post, send mail to amber.scripps.edu
>>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Mar 28 2007 - 06:07:28 PDT