Re: [AMBER] MMPBSA errors: "failed with prmtop" and "failed when querying netcdf trajectory" from Vlad Cojocaru on 2013-11-12 (Amber Archive Nov 2013)

From: Vlad Cojocaru <vlad.cojocaru.mpi-muenster.mpg.de>
Date: Tue, 12 Nov 2013 15:22:50 +0100

Ok .. I think I got it now ...

LPB needs 1.3 GB per 1 frame if spacing = 0.5 and 8 GB per 1 frame if
spacing = 0.25
NLPB seems to have similar memory requirements but of course is more
CPU-expensive.

Bottom line I need to insure that I have about 8GB per runnuing core
available all the time to be able to use 0.25 grid spacing ...
And on the cluster I only have 4 GB

decreasing the gris spacing seems to make quite a significant difference
with NLPB bu not with LPB.

I hope now I got it right ...

Best wishes
Vlad

That means that I need to have 8 GB per core available
On 11/12/2013 02:36 PM, Vlad Cojocaru wrote:
> Hi Jason,
>
> I do not have so much control over the cluster as its not a local cluster..
>
> I was indeed using the 256 cores as requested (at least to my knowledge
> cannot do it differently on this machine) .... Well, it seems that I
> don't fully understand how MMPBSA deals with the memory ... So, I was
> thinking that the memory usage per job should not change depending on
> the number of cores since the number of frames analyzed per core
> decreases with the increase of the number of cores ...
>
> Obviously, my thinking is flawed as from what you are saying the memory
> requirements increase with the number of cores ...
>
> So, if I get the memory usage for a single frame on a single core, can I
> actually calculate how much memory I need for lets say 10000 frames on
> 128 cores ?
>
> I will do some single core, single frames tests now ..
>
> Best wishes
> Vlad
>
> On 11/12/2013 02:17 PM, Jason Swails wrote:
>> On Tue, 2013-11-12 at 11:53 +0100, Vlad Cojocaru wrote:
>>> Dear all,
>>>
>>> I am experiencing quite some strange errors while running MMPBSA
>>> (AmberTools 13) on a cluster (see below Errors 1&2 and Outputs 1&2) .
>>> These errors do not make any sense since the top files and the
>>> trajectories are there and are correct. Besides, exactly the same jobs
>>> run properly sometimes (not many times though). Speaking with the
>>> support team from the cluster, they told me that my jobs were using an
>>> incredible amount of memory (623 GB when runnig on 128 cores) .. However
>>> when I increased the number of cores to 256 to account for the maximum
>>> memory available (4 GB/core), the same errors poped up...
>> Did you continue to _use_ all 256 cores that you requested, or did you
>> use 128 cores while requesting the resources for 256? The stdout stream
>> should indicate how many cores MMPBSA.py.MPI was attempting to run on.
>> If 128 cores request 623 GB of RAM, you can expect 256 cores to request
>> a bit over 1 TB.
>>
>> What you would need to do is make sure you request enough resources to
>> get the necessary RAM, but then only run on as many cores as that RAM
>> will support. If you are using a distributed memory cluster, you need
>> to make sure that the threads run in the correct 'place'. Usually this
>> means start the same number of threads for each node that you've
>> requested.
>>
>>> These errors initially appeared for the non-linear PB calculation with a
>>> grid spacing of 0.25 but the same errors are reproducible with linear PB
>>> and the default spacing of 0.5 ... which makes me skeptical about the
>>> memory issue ...
>> NLPB and LPB use the same grid, to my knowledge. NLPB evaluates the
>> full tanh expression for the ionic strength contribution (as opposed to
>> the first term in the Taylor series expansion), which is where the added
>> cost comes from. Doubling the grid spacing reduces the grid's footprint
>> in memory by 1/8. My suggestion if you're having this much trouble with
>> memory is to analyze one frame on an interactive node (or your personal
>> machine) and monitor the size of the grid and the total used memory.
>>
>>> I should also add that while Error 1 occurs at the beginning of the run,
>>> Error 2 occurs sometime while the job appears to run correctly ... I
>>> also set the debug printlevel to 1 but the errors (given below) are not
>>> comprehensible ....
>>>
>>> Amber 12 + AmberTools 13 updated as of yesterday were compiled with
>>> Intel 13.0 and Intel MPI 4.1.0
>>>
>>> Has anybody seen anything alike before ?
>>>
>>> Best wishes
>>> Vlad
>>>
>>> ******** Error 1**************
>>> TrajError:
>>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
>>> failed when querying complex.cdf
>>> Error occured on rank 0.
>>> Exiting. All files have been retained.
>> This error indicates cpptraj was unable to read complex.cdf, or that for
>> some reason the input or output of cpptraj got mangled. This step is
>> only performed by a single thread and all communication is done through
>> pipes, so it's hard to debug without seeing exactly what happened.
>>
>> I've attached a patch that should print out more information when this
>> error occurs. If you apply it via:
>>
>> ./update_amber --apply path/to/mmpbsa_errormsg.patch
>>
>> and then recompile (you can do this with
>>
>> make -C $AMBERHOME/AmberTools/src/mmpbsa_py install
>>
>> to avoid recompiling everything), it should print out more helpful
>> information next time. I'll need the output from that to figure out
>> what's really happening here.
>>
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> [8:gwdn165] unexpected disconnect completion event from [0:gwdn028]
>>> Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
>>>
>>> ******** Output 1**************
>>> Loading and checking parameter files for compatibility...
>>> sander found! Using
>>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
>>> cpptraj found! Using
>>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
>>> Preparing trajectories for simulation...
>>> rank 16 in job 1 gwdn028_38960 caused collective abort of all ranks
>>> exit status of rank 16: killed by signal 9
>>> rank 0 in job 1 gwdn028_38960 caused collective abort of all ranks
>>> exit status of rank 0: killed by signal 9
>>>
>>> ******** Error 2**************
>>> CalcError:
>>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
>>> failed with prmtop complex.top!
>>> Error occured on rank 93.
>>> Exiting. All files have been retained.
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 93
>> Is there anything printed in _MMPBSA_complex_pb.mdout.93? There's no
>> error message here. It might be memory overflow or might not. I can't
>> be sure.
>>
>> All the best,
>> Jason
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber

-- 
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cell and Developmental Biology
Röntgenstrasse 20, 48149 Münster, Germany
Tel: +49-251-70365-324; Fax: +49-251-70365-399
Email: vlad.cojocaru[at]mpi-muenster.mpg.de
http://www.mpi-muenster.mpg.de/research/teams/groups/rgcojocaru
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue Nov 12 2013 - 06:30:03 PST