Re: [AMBER] MMPBSA errors: "failed with prmtop" and "failed when querying netcdf trajectory" from Jason Swails on 2013-11-12 (Amber Archive Nov 2013)

From: Jason Swails <jason.swails.gmail.com>
Date: Tue, 12 Nov 2013 08:17:09 -0500

On Tue, 2013-11-12 at 11:53 +0100, Vlad Cojocaru wrote:
> Dear all,
>
> I am experiencing quite some strange errors while running MMPBSA
> (AmberTools 13) on a cluster (see below Errors 1&2 and Outputs 1&2) .
> These errors do not make any sense since the top files and the
> trajectories are there and are correct. Besides, exactly the same jobs
> run properly sometimes (not many times though). Speaking with the
> support team from the cluster, they told me that my jobs were using an
> incredible amount of memory (623 GB when runnig on 128 cores) .. However
> when I increased the number of cores to 256 to account for the maximum
> memory available (4 GB/core), the same errors poped up...

Did you continue to _use_ all 256 cores that you requested, or did you
use 128 cores while requesting the resources for 256? The stdout stream
should indicate how many cores MMPBSA.py.MPI was attempting to run on.
If 128 cores request 623 GB of RAM, you can expect 256 cores to request
a bit over 1 TB.

What you would need to do is make sure you request enough resources to
get the necessary RAM, but then only run on as many cores as that RAM
will support. If you are using a distributed memory cluster, you need
to make sure that the threads run in the correct 'place'. Usually this
means start the same number of threads for each node that you've
requested.

> These errors initially appeared for the non-linear PB calculation with a
> grid spacing of 0.25 but the same errors are reproducible with linear PB
> and the default spacing of 0.5 ... which makes me skeptical about the
> memory issue ...

NLPB and LPB use the same grid, to my knowledge. NLPB evaluates the
full tanh expression for the ionic strength contribution (as opposed to
the first term in the Taylor series expansion), which is where the added
cost comes from. Doubling the grid spacing reduces the grid's footprint
in memory by 1/8. My suggestion if you're having this much trouble with
memory is to analyze one frame on an interactive node (or your personal
machine) and monitor the size of the grid and the total used memory.

> I should also add that while Error 1 occurs at the beginning of the run,
> Error 2 occurs sometime while the job appears to run correctly ... I
> also set the debug printlevel to 1 but the errors (given below) are not
> comprehensible ....
>
> Amber 12 + AmberTools 13 updated as of yesterday were compiled with
> Intel 13.0 and Intel MPI 4.1.0
>
> Has anybody seen anything alike before ?
>
> Best wishes
> Vlad
>
> ******** Error 1**************
> TrajError:
> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
> failed when querying complex.cdf
> Error occured on rank 0.
> Exiting. All files have been retained.

This error indicates cpptraj was unable to read complex.cdf, or that for
some reason the input or output of cpptraj got mangled. This step is
only performed by a single thread and all communication is done through
pipes, so it's hard to debug without seeing exactly what happened.

I've attached a patch that should print out more information when this
error occurs. If you apply it via:

./update_amber --apply path/to/mmpbsa_errormsg.patch

and then recompile (you can do this with

make -C $AMBERHOME/AmberTools/src/mmpbsa_py install

to avoid recompiling everything), it should print out more helpful
information next time. I'll need the output from that to figure out
what's really happening here.

> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> [8:gwdn165] unexpected disconnect completion event from [0:gwdn028]
> Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
>
> ******** Output 1**************
> Loading and checking parameter files for compatibility...
> sander found! Using
> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
> cpptraj found! Using
> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
> Preparing trajectories for simulation...
> rank 16 in job 1 gwdn028_38960 caused collective abort of all ranks
> exit status of rank 16: killed by signal 9
> rank 0 in job 1 gwdn028_38960 caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
>
> ******** Error 2**************
> CalcError:
> /usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
> failed with prmtop complex.top!
> Error occured on rank 93.
> Exiting. All files have been retained.
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 93

Is there anything printed in _MMPBSA_complex_pb.mdout.93? There's no
error message here. It might be memory overflow or might not. I can't
be sure.

All the best,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

text/x-patch attachment: mmpbsa_errormsg.patch

Received on Tue Nov 12 2013 - 05:30:02 PST