Re: [AMBER] MMPBSA errors: "failed with prmtop" and "failed when querying netcdf trajectory" from Vlad Cojocaru on 2013-11-14 (Amber Archive Nov 2013)

From: Vlad Cojocaru <vlad.cojocaru.mpi-muenster.mpg.de>
Date: Thu, 14 Nov 2013 15:35:48 +0100

Hi Jason,

I applied the patch you sent me for deciphering the "failed when
querying netcdf trajectory" error. However, the error file doesn't say
anything ..
This error is really weird ... It makes no sense that cpptraj does not
read the netcdf ... The file is there and an a serial MMPBSA analysis on
its first frame always works. Besides, even the parallel job runs
sometimes properly (unfortunately rarely so that the crashes are
disturbing) ...

Best wishes
Vlad

--------- new error message after patching -----------------------

OUTPUT:

ERROR:
None

TrajError:
/usr/users/vcojoca/apps/cluster_intel/amber/12_tools-13_intel-13.0_impi-4.1.0_patched/bin/cpptraj
failed when querying oct4_sox2-k57e_cano.cdf
Error occured on rank 0.
Exiting. All files have been retained.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[8:gwdn064] unexpected disconnect completion event from [0:gwdn141]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 8
[16:gwdn134] unexpected disconnect completion event from [0:gwdn141]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 16
[32:gwdn158] unexpected disconnect completion event from [0:gwdn141]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 32
[64:gwdn072] unexpected disconnect completion event from [0:gwdn141]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 64

--------------------- output -----------------
reparing trajectories for simulation...
rank 64 in job 1 gwdn141_57524 caused collective abort of all ranks
   exit status of rank 64: killed by signal 9
rank 32 in job 1 gwdn141_57524 caused collective abort of all ranks
   exit status of rank 32: killed by signal 9
rank 0 in job 1 gwdn141_57524 caused collective abort of all ranks
   exit status of rank 0: killed by signal 9

On 11/12/2013 04:41 PM, Vlad Cojocaru wrote:
> Ok .. Thanks a lot ..
> I got it right in the end ...
>
> Maybe one more question .. There is a huge difference between the VIRT
> and RES memory usage output by "top" while running a single frame MMPBSA
> analysis ..
>
> I get something like 7500 mb VIRT and 3700 RES ...
>
> It should be the RES I should count, isn't it ?
>
> Thanks again
> Vlad
>
> On 11/12/2013 04:22 PM, Jason Swails wrote:
>> On Tue, 2013-11-12 at 14:36 +0100, Vlad Cojocaru wrote:
>>> Hi Jason,
>>>
>>> I do not have so much control over the cluster as its not a local cluster..
>>>
>>> I was indeed using the 256 cores as requested (at least to my knowledge
>>> cannot do it differently on this machine) .... Well, it seems that I
>>> don't fully understand how MMPBSA deals with the memory ... So, I was
>>> thinking that the memory usage per job should not change depending on
>>> the number of cores since the number of frames analyzed per core
>>> decreases with the increase of the number of cores ...
>> MMPBSA.py analyzes frames sequentially. If you are running in serial,
>> there is never more than 1 frame being analyzed at a time (and therefore
>> only one frame in memory). So regardless of how many frames are being
>> analyzed, the memory consumption will not change.
>>
>> In parallel with N threads, MMPBSA.py.MPI splits up the whole trajectory
>> into N equal-sized (or as close as possible) smaller trajectories which
>> are each then analyzed sequentially. As a result, with N threads you
>> are analyzing N frames at a time, and therefore using N times the memory
>> used in serial.
>>
>> The alternative, which would give far poorer scaling, would be to
>> analyze each frame using the number of requested cores, which would in
>> turn depend on the parallelizability of the requested algorithm. For GB
>> this is OK, but for PB it is quite limiting. The approach of
>> parallelizing over frames takes advantage of the embarrassingly parallel
>> property of MM/PBSA calculations and is why you can get nearly ideal
>> scaling up to ca. nframes/2 processors.
>>
>>> Obviously, my thinking is flawed as from what you are saying the memory
>>> requirements increase with the number of cores ...
>>>
>>> So, if I get the memory usage for a single frame on a single core, can I
>>> actually calculate how much memory I need for lets say 10000 frames on
>>> 128 cores ?
>>>
>>> I will do some single core, single frames tests now ..
>> As I said above, the memory requirements depend on how many frames are
>> being analyzed concurrently---not how many frames are being analyzed
>> total. With 128 cores, you are analyzing 128 frames at once, so you
>> have to make sure you have enough memory for that. If each node has,
>> say, 32 GB of memory for 16 cores, you will need to ask for all 16
>> cores, but run no more than 4 threads (which will use all 32 GB of RAM)
>> on that node. [I would actually err on the side of caution and only run
>> 3 threads per node to allow up to 8 GB of overrun for each thread.]
>>
>> Many queuing systems also allow memory to be requested as a resource,
>> which means you can specify how much memory you want made available to
>> your job per processor. Other clusters may require you to use a full
>> node, so setting per-process memory limits wouldn't make as much sense.
>> This is where the cluster documentation helps significantly.
>>
>> Good luck,
>> Jason
>>

-- 
Dr. Vlad Cojocaru
Max Planck Institute for Molecular Biomedicine
Department of Cell and Developmental Biology
Röntgenstrasse 20, 48149 Münster, Germany
Tel: +49-251-70365-324; Fax: +49-251-70365-399
Email: vlad.cojocaru[at]mpi-muenster.mpg.de
http://www.mpi-muenster.mpg.de/research/teams/groups/rgcojocaru
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Nov 14 2013 - 07:00:03 PST