Re: [AMBER] MMPBSA errors: "failed with prmtop" and "failed when querying netcdf trajectory" from Vlad Cojocaru on 2013-11-12 (Amber Archive Nov 2013)

From: Vlad Cojocaru <vlad.cojocaru.mpi-muenster.mpg.de>
Date: Tue, 12 Nov 2013 14:36:23 +0100

Hi Jason,

I do not have so much control over the cluster as its not a local cluster..

I was indeed using the 256 cores as requested (at least to my knowledge
cannot do it differently on this machine) .... Well, it seems that I
don't fully understand how MMPBSA deals with the memory ... So, I was
thinking that the memory usage per job should not change depending on
the number of cores since the number of frames analyzed per core
decreases with the increase of the number of cores ...

Obviously, my thinking is flawed as from what you are saying the memory
requirements increase with the number of cores ...

So, if I get the memory usage for a single frame on a single core, can I
actually calculate how much memory I need for lets say 10000 frames on
128 cores ?

I will do some single core, single frames tests now ..

Best wishes
Vlad

On 11/12/2013 02:17 PM, Jason Swails wrote:
> On Tue, 2013-11-12 at 11:53 +0100, Vlad Cojocaru wrote:
>> Dear all,
>>
>> I am experiencing quite some strange errors while running MMPBSA
>> (AmberTools 13) on a cluster (see below Errors 1&2 and Outputs 1&2) .
>> These errors do not make any sense since the top files and the
>> trajectories are there and are correct. Besides, exactly the same jobs
>> run properly sometimes (not many times though). Speaking with the
>> support team from the cluster, they told me that my jobs were using an
>> incredible amount of memory (623 GB when runnig on 128 cores) .. However
>> when I increased the number of cores to 256 to account for the maximum
>> memory available (4 GB/core), the same errors poped up...
> Did you continue to _use_ all 256 cores that you requested, or did you
> use 128 cores while requesting the resources for 256? The stdout stream
> should indicate how many cores MMPBSA.py.MPI was attempting to run on.
> If 128 cores request 623 GB of RAM, you can expect 256 cores to request
> a bit over 1 TB.
>
> What you would need to do is make sure you request enough resources to
> get the necessary RAM, but then only run on as many cores as that RAM
> will support. If you are using a distributed memory cluster, you need
> to make sure that the threads run in the correct 'place'. Usually this
> means start the same number of threads for each node that you've
> requested.
>
>> These errors initially appeared for the non-linear PB calculation with a
>> grid spacing of 0.25 but the same errors are reproducible with linear PB
>> and the default spacing of 0.5 ... which makes me skeptical about the
>> memory issue ...
> NLPB and LPB use the same grid, to my knowledge. NLPB evaluates the
> full tanh expression for the ionic strength contribution (as opposed to
> the first term in the Taylor series expansion), which is where the added
> cost comes from. Doubling the grid spacing reduces the grid's footprint
> in memory by 1/8. My suggestion if you're having this much trouble with
> memory is to analyze one frame on an interactive node (or your personal
> machine) and monitor the size of the grid and the total used memory.
>
>> I should also add that while Error 1 occurs at the beginning of the run,
>> Error 2 occurs sometime while the job appears to run correctly ... I
>> also set the debug printlevel to 1 but the errors (given below) are not
>> comprehensible ....
>>
>> Amber 12 + AmberTools 13 updated as of yesterday were compiled with
>> Intel 13.0 and Intel MPI 4.1.0
>>
>> Has anybody seen anything alike before ?
>>
>> Best wishes
>> Vlad
>>
>> ******** Error 1**************
>> TrajError:
>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
>> failed when querying complex.cdf
>> Error occured on rank 0.
>> Exiting. All files have been retained.
> This error indicates cpptraj was unable to read complex.cdf, or that for
> some reason the input or output of cpptraj got mangled. This step is
> only performed by a single thread and all communication is done through
> pipes, so it's hard to debug without seeing exactly what happened.
>
> I've attached a patch that should print out more information when this
> error occurs. If you apply it via:
>
> ./update_amber --apply path/to/mmpbsa_errormsg.patch
>
> and then recompile (you can do this with
>
> make -C $AMBERHOME/AmberTools/src/mmpbsa_py install
>
> to avoid recompiling everything), it should print out more helpful
> information next time. I'll need the output from that to figure out
> what's really happening here.
>
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>> [8:gwdn165] unexpected disconnect completion event from [0:gwdn028]
>> Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
>>
>> ******** Output 1**************
>> Loading and checking parameter files for compatibility...
>> sander found! Using
>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
>> cpptraj found! Using
>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
>> Preparing trajectories for simulation...
>> rank 16 in job 1 gwdn028_38960 caused collective abort of all ranks
>> exit status of rank 16: killed by signal 9
>> rank 0 in job 1 gwdn028_38960 caused collective abort of all ranks
>> exit status of rank 0: killed by signal 9
>>
>> ******** Error 2**************
>> CalcError:
>> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
>> failed with prmtop complex.top!
>> Error occured on rank 93.
>> Exiting. All files have been retained.
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 93
> Is there anything printed in _MMPBSA_complex_pb.mdout.93? There's no
> error message here. It might be memory overflow or might not. I can't
> be sure.
>
> All the best,
> Jason
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

-- 
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cell and Developmental Biology
Röntgenstrasse 20, 48149 Münster, Germany
Tel: +49-251-70365-324; Fax: +49-251-70365-399
Email: vlad.cojocaru[at]mpi-muenster.mpg.de
http://www.mpi-muenster.mpg.de/research/teams/groups/rgcojocaru
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue Nov 12 2013 - 06:00:02 PST