Re: [AMBER] Extract velocity from the restart file from Yin, Guowei on 2014-05-05 (Amber Archive May 2014)

From: Yin, Guowei <guowei_yin.med.unc.edu>
Date: Mon, 5 May 2014 13:39:17 +0000

Hi Jason,

Thanks for your suggestions. By the way, sometimes, only the sentences like below shown in log file without saying Vlimit:

pmemd.MPI 000000000041E2D6 Unknown Unknown Unknown
libc.so.6 00000030FD01D994 Unknown Unknown Unknown
pmemd.MPI 000000000041E1D9 Unknown Unknown Unknown
--------------------------------------------------------------------------
orterun has exited due to process rank 0 with PID 1984 on
node n-5-2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

What would this mean? The run just stopped with not any clear indications. Thank you.

Best,
Guowei

-----Original Message-----
From: Jason Swails [mailto:jason.swails.gmail.com]
Sent: Wednesday, April 30, 2014 11:33 AM
To: amber.ambermd.org
Subject: Re: [AMBER] Extract velocity from the restart file

On Wed, 2014-04-30 at 15:14 +0000, Yin, Guowei wrote:
> Hi David,
>
>
>
> Thank you for the reply. As you said that Vlimit problem itself would not stop a run and there are some reasons else. I just copy the related part in *.log file, could you help me to diagnose?
>
>
>
> ==================================
>
> vlimit exceeded for step ******; vmax = 29.9206
>
> vlimit exceeded for step ******; vmax = 21.0694
>
> vlimit exceeded for step ******; vmax = 26.8572
>
> vlimit exceeded for step ******; vmax = 46.3269
>
> vlimit exceeded for step ******; vmax = 27.0013
>
> vlimit exceeded for step ******; vmax = 45.7748
>
> ----------------------------------------------------------------------
> ----
>
> MPI_ABORT was invoked on rank 36 in communicator MPI_COMM_WORLD
>
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> ----------------------------------------------------------------------
> ----
>
> ----------------------------------------------------------------------
> ----
>
> orterun has exited due to process rank 36 with PID 3004 on
>
> node n-2-12 exiting without calling "finalize". This may
>
> have caused other processes in the application to be
>
> terminated by signals sent by orterun (as reported here).
>
>
>
>
>
> forrtl: error (78): process killed (SIGTERM)
>
> Image PC Routine Line Source
>
> libintlc.so.5 00002AD753ADC1F9 Unknown Unknown Unknown
>
> libintlc.so.5 00002AD753ADAB70 Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD75291D5EF Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD7528819B9 Unknown Unknown Unknown
>
> libifcoremt.so.5 00002AD75289350E Unknown Unknown Unknown
>
> libpthread.so.0 0000003F8740EB10 Unknown Unknown Unknown
>
> libpthread.so.0 0000003F8740B725 Unknown Unknown Unknown
>
> libmlx4-rdmav2.so 00002AAAAAABDECC Unknown Unknown Unknown
>
> mca_btl_openib.so 00002AD7569EFCAF Unknown Unknown Unknown
>
> libopen-pal.so.0 00002AD75238A394 Unknown Unknown Unknown
>
> libmpi.so.0 00002AD751E7E1B1 Unknown Unknown Unknown
>
> libmpi.so.0 00002AD751EAF492 Unknown Unknown Unknown
>
> libmpi_f77.so.0 00002AD751C39839 Unknown Unknown Unknown
>
> pmemd.MPI 00000000005AAFA1 Unknown Unknown Unknown
>
> pmemd.MPI 00000000005A8D8A Unknown Unknown Unknown
>
> pmemd.MPI 00000000005A6E9F Unknown Unknown Unknown
>
> pmemd.MPI 0000000000573E28 Unknown Unknown Unknown
>
> pmemd.MPI 0000000000561DFD Unknown Unknown Unknown
>
> pmemd.MPI 00000000004E98C2 Unknown Unknown Unknown
>
> pmemd.MPI 000000000041E2D6 Unknown Unknown Unknown
>
> libc.so.6 0000003F86C1D994 Unknown Unknown Unknown
>
> pmemd.MPI 000000000041E1D9 Unknown Unknown Unknown
>
> Apr 30 00:41:22 2014 6470 4 7.06 checkPAMRESActionTab: action 31
> RES_KILL_TASKS 9 to host <n-11-13> timed out after 60 seconds
>
> Apr 30 00:47:25 2014 6470 3 7.06 PAM: waitForPJLExit: Timed out while
> waiting for PJL to exit. Sending SIGKILL
>
>
>
> TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
>
> ===== ========== ================ =======================
> ===================
>
> 00000 n-3-3 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:22
>
> 00001 n-11-12 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:22
>
> 00002 n-11-12 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:25
>
> 00003 n-11-12 pmemd.MPI -p ./G Signaled (SIGPIPE) 04/30/2014 00:40:25
>
> ... ...
>
>
>
> 00062 n-4-8 pmemd.MPI -p ./G Killed by PAM (SIGKILL) 04/30/2014 00:40:25
>
>
>
> 00063 n-2-16 pmemd.MPI -p ./G Exit (1) 04/30/2014 00:40:25
>
>
>
> ====================
>
>
>
> So, is the problem caused by MPI job or Vlimit? I tried the smaller
> time step but it still didn't cure. Thank you.

It looks like the simulation blew up. I suspect a SHAKE failure occurred on one of the non-master processors, but the repeated vlimit problems (and the fact that the max velocities are increasing) suggests a blowup of some sort.

I suspect a couple atoms are getting a little too close. There are several reasons this could happen -- pressure coupling is too strong so the box is shrinking too much in 1 step, the time step is too long, or there is some kind of parameter problem in which there is not enough vdW repulsion between a particular pair of atoms.

This will require some investigation by you -- try to print out all snapshots around the point at which the blowup occurs and see if visualizing the process yields any insight.

HTH,
Jason

--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Mon May 05 2014 - 07:00:06 PDT