Re: [AMBER] Extract velocity from the restart file

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 30 Apr 2014 11:32:54 -0400

On Wed, 2014-04-30 at 15:14 +0000, Yin, Guowei wrote:
> Hi David,
>
>
>
> Thank you for the reply. As you said that Vlimit problem itself would not stop a run and there are some reasons else. I just copy the related part in *.log file, could you help me to diagnose?
>
>
>
> ==================================
>
> vlimit exceeded for step ******; vmax = 29.9206
>
> vlimit exceeded for step ******; vmax = 21.0694
>
> vlimit exceeded for step ******; vmax = 26.8572
>
> vlimit exceeded for step ******; vmax = 46.3269
>
> vlimit exceeded for step ******; vmax = 27.0013
>
> vlimit exceeded for step ******; vmax = 45.7748
>
> --------------------------------------------------------------------------
>
> MPI_ABORT was invoked on rank 36 in communicator MPI_COMM_WORLD
>
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> orterun has exited due to process rank 36 with PID 3004 on
>
> node n-2-12 exiting without calling "finalize". This may
>
> have caused other processes in the application to be
>
> terminated by signals sent by orterun (as reported here).
>
>
>
>
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> libintlc.so.5 00002AD753ADC1F9 Unknown Unknown Unknown
>
> libintlc.so.5 00002AD753ADAB70 Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD75291D5EF Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD7528819B9 Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD75289350E Unknown Unknown Unknown
>
> libpthread.so.0 0000003F8740EB10 Unknown Unknown Unknown
>
> libpthread.so.0 0000003F8740B725 Unknown Unknown Unknown
>
> libmlx4-rdmav2.so 00002AAAAAABDECC Unknown Unknown Unknown
>
> mca_btl_openib.so 00002AD7569EFCAF Unknown Unknown Unknown
>
> libopen-pal.so.0 00002AD75238A394 Unknown Unknown Unknown
>
> libmpi.so.0 00002AD751E7E1B1 Unknown Unknown Unknown
>
> libmpi.so.0 00002AD751EAF492 Unknown Unknown Unknown
>
> libmpi_f77.so.0 00002AD751C39839 Unknown Unknown Unknown
>
> pmemd.MPI 00000000005AAFA1 Unknown Unknown Unknown
>
> pmemd.MPI 00000000005A8D8A Unknown Unknown Unknown
>
> pmemd.MPI 00000000005A6E9F Unknown Unknown Unknown
>
> pmemd.MPI 0000000000573E28 Unknown Unknown Unknown
>
> pmemd.MPI 0000000000561DFD Unknown Unknown Unknown
>
> pmemd.MPI 00000000004E98C2 Unknown Unknown Unknown
>
> pmemd.MPI 000000000041E2D6 Unknown Unknown Unknown
>
> libc.so.6 0000003F86C1D994 Unknown Unknown Unknown
>
> pmemd.MPI 000000000041E1D9 Unknown Unknown Unknown
>
> Apr 30 00:41:22 2014 6470 4 7.06 checkPAMRESActionTab: action 31 RES_KILL_TASKS 9 to host <n-11-13> timed out after 60 seconds
>
> Apr 30 00:47:25 2014 6470 3 7.06 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
>
>
>
> TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
>
> ===== ========== ================ ======================= ===================
>
> 00000 n-3-3 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:22
>
> 00001 n-11-12 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:22
>
> 00002 n-11-12 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:25
>
> 00003 n-11-12 pmemd.MPI -p ./G Signaled (SIGPIPE) 04/30/2014 00:40:25
>
> ... ...
>
>
>
> 00062 n-4-8 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:25
>
>
>
> 00063 n-2-16 pmemd.MPI -p ./G Exit (1) 04/30/2014 00:40:25
>
>
>
> ====================
>
>
>
> So, is the problem caused by MPI job or Vlimit? I tried the smaller
> time step but it still didn't cure. Thank you.

It looks like the simulation blew up. I suspect a SHAKE failure
occurred on one of the non-master processors, but the repeated vlimit
problems (and the fact that the max velocities are increasing) suggests
a blowup of some sort.

I suspect a couple atoms are getting a little too close. There are
several reasons this could happen -- pressure coupling is too strong so
the box is shrinking too much in 1 step, the time step is too long, or
there is some kind of parameter problem in which there is not enough vdW
repulsion between a particular pair of atoms.

This will require some investigation by you -- try to print out all
snapshots around the point at which the blowup occurs and see if
visualizing the process yields any insight.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 30 2014 - 09:00:02 PDT
Custom Search