Dear all,
I am experiencing quite some strange errors while running MMPBSA
(AmberTools 13) on a cluster (see below Errors 1&2 and Outputs 1&2) .
These errors do not make any sense since the top files and the
trajectories are there and are correct. Besides, exactly the same jobs
run properly sometimes (not many times though). Speaking with the
support team from the cluster, they told me that my jobs were using an
incredible amount of memory (623 GB when runnig on 128 cores) .. However
when I increased the number of cores to 256 to account for the maximum
memory available (4 GB/core), the same errors poped up...
These errors initially appeared for the non-linear PB calculation with a
grid spacing of 0.25 but the same errors are reproducible with linear PB
and the default spacing of 0.5 ... which makes me skeptical about the
memory issue ...
I should also add that while Error 1 occurs at the beginning of the run,
Error 2 occurs sometime while the job appears to run correctly ... I
also set the debug printlevel to 1 but the errors (given below) are not
comprehensible ....
Amber 12 + AmberTools 13 updated as of yesterday were compiled with
Intel 13.0 and Intel MPI 4.1.0
Has anybody seen anything alike before ?
Best wishes
Vlad
******** Error 1**************
TrajError:
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
failed when querying complex.cdf
Error occured on rank 0.
Exiting. All files have been retained.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[8:gwdn165] unexpected disconnect completion event from [0:gwdn028]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
******** Output 1**************
Loading and checking parameter files for compatibility...
sander found! Using
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
cpptraj found! Using
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
Preparing trajectories for simulation...
rank 16 in job 1 gwdn028_38960 caused collective abort of all ranks
exit status of rank 16: killed by signal 9
rank 0 in job 1 gwdn028_38960 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
******** Error 2**************
CalcError:
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
failed with prmtop complex.top!
Error occured on rank 93.
Exiting. All files have been retained.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 93
******** Output 2**************
Loading and checking parameter files for compatibility...
sander found! Using
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
cpptraj found! Using
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/cpptraj
Preparing trajectories for simulation...
400 frames were processed by cpptraj for use in calculation.
Running calculations on normal system...
Beginning PB calculations with
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0/bin/sander
calculating complex contribution...
rank 93 in job 1 gwdc125_46727 caused collective abort of all ranks
exit status of rank 93: killed by signal 9
rank 92 in job 1 gwdc125_46727 caused collective abort of all ranks
exit status of rank 92: killed by signal 9
rank 88 in job 1 gwdc125_46727 caused collective abort of all ranks
exit status of rank 88: killed by signal 9
--
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cell and Developmental Biology
Röntgenstrasse 20, 48149 Münster, Germany
Tel: +49-251-70365-324; Fax: +49-251-70365-399
Email: vlad.cojocaru[at]mpi-muenster.mpg.de
http://www.mpi-muenster.mpg.de/research/teams/groups/rgcojocaru
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Nov 12 2013 - 03:00:02 PST