[AMBER] Amber16 pmemd.cuda.MPI simulation can drop a frame with an inopportune node crash

From: Chris Neale <candrewn.gmail.com>
Date: Sat, 19 May 2018 01:07:06 -0600

Hello,

I am reporting very rare behavior of amber16 (pmemd.cuda.MPI) in which a
single frame of the trajectory can be lost when there is a crash after the
.rst file is written but before the .mdcrd file is completely written
(though it's possible that I misunderstand what is happening). I've only
ever seen this once in a couple of years of many runs.

1) The .out file from simulation segment A lists the last timestep as N
multiples of the save frequency
2) The next simulation segment, B, from the previous .rst file (generate by
A) starts at N+1 multiples of the save frequency
3) The .info file from simulation A only lists N-1 multiples of the save
frequency
4) The .mdcrd file from run A only contains N-1 multiples of the save
frequency

Therefore, I lost a single frame.

It's obviously not a big deal, but I thought it was worth reporting.

### Here is the entire frame info in the .out file from simulation A
(something obviously happened to the node early in the run, as there are
only 5 frames):

 NSTEP = 500000 TIME(PS) = 252000.000 TEMP(K) = 310.55 PRESS =
0.0
 Etot = -1443981.0250 EKtot = 511461.3750 EPtot =
-1955442.4000
 BOND = 31063.5846 ANGLE = 104624.2155 DIHED =
94789.5943
 UB = 38637.6267 IMP = 1622.4678 CMAP =
-480.2490
 1-4 NB = 16734.8086 1-4 EEL = 22438.6683 VDWAALS =
110161.7064
 EELEC = -2375034.8232 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7707174.2890
                                                    SURFTEN =
0.0000
                                                    Density =
1.0159
 ------------------------------------------------------------------------------


 NSTEP = 750000 TIME(PS) = 253000.000 TEMP(K) = 310.39 PRESS =
0.0
 Etot = -1443585.7144 EKtot = 511200.8125 EPtot =
-1954786.5269
 BOND = 30997.3434 ANGLE = 104278.1909 DIHED =
94840.9091
 UB = 38883.1231 IMP = 1552.7852 CMAP =
-480.8654
 1-4 NB = 16700.8642 1-4 EEL = 22547.0194 VDWAALS =
109796.3894
 EELEC = -2373902.2861 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7717052.7828
                                                    SURFTEN =
0.0000
                                                    Density =
1.0146
 ------------------------------------------------------------------------------


 NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
0.0
 Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
-1955749.7900
 BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
94639.0638
 UB = 38576.6598 IMP = 1583.3243 CMAP =
-529.8053
 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
110061.3064
 EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7714688.7777
                                                    SURFTEN =
0.0000
                                                    Density =
1.0149
 ------------------------------------------------------------------------------


 NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
0.0
 Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
-1955204.9500
 BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
95135.0864
 UB = 38725.3307 IMP = 1579.5279 CMAP =
-506.7794
 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
108843.7175
 EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7709825.3119
                                                    SURFTEN =
0.0000
                                                    Density =
1.0156
 ------------------------------------------------------------------------------


 NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
0.0
 Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
-1955204.9500
 BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
95135.0864
 UB = 38725.3307 IMP = 1579.5279 CMAP =
-506.7794
 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
108843.7175
 EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7709825.3119
                                                    SURFTEN =
0.0000
                                                    Density =
1.0156

### Here is the first output of a frame from the .out file from simualtion
B:

 NSTEP = 250000 TIME(PS) = 256000.000 TEMP(K) = 310.53 PRESS =
0.0
 Etot = -1445111.8674 EKtot = 511427.6562 EPtot =
-1956539.5237
 BOND = 30987.3060 ANGLE = 104014.2392 DIHED =
94880.5768
 UB = 38713.1815 IMP = 1588.8147 CMAP =
-528.8742
 1-4 NB = 16677.7195 1-4 EEL = 22418.8460 VDWAALS =
110184.1128
 EELEC = -2375475.4458 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7706278.4149
                                                    SURFTEN =
0.0000
                                                    Density =
1.0160

### Here is the .info file from simulation A:

 NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
0.0
 Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
-1955749.7900
 BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
94639.0638
 UB = 38576.6598 IMP = 1583.3243 CMAP =
-529.8053
 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
110061.3064
 EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
7714688.7777
                                                    SURFTEN =
0.0000
                                                    Density =
1.0149


### Here is output showing that the .mdcrd has only 4 frames (it should
have 5):

bash-4.2$ cpptraj -i cpptraj.inp

CPPTRAJ: Trajectory Analysis. V16.16
    ___ ___ ___ ___
     | \/ | \/ | \/ |
    _|_/\_|_/\_|_/\_|_

| Date/time: 05/19/18 01:03:08
| Available memory: 79.699 GB

INPUT: Reading input from 'cpptraj.inp'
  [parm bot240520ps/this.prmtop]
    Reading 'bot240520ps/this.prmtop' as Amber Topology
    CHAMBER topology: 1: CHARMM force field:
No FF information parsed...
  [trajin bot240520ps/vbot240520ps_11.mdcrd]
    Reading 'bot240520ps/vbot240520ps_11.mdcrd' as Amber NetCDF
  [list trajin]

INPUT TRAJECTORIES (1 total):
 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
(Orthogonal box) (reading 4 of 4)
  Coordinate processing will occur on 4 frames.
  [run]
---------- RUN BEGIN -------------------------------------------------

PARAMETER FILES (1 total):
 0: this.prmtop, 783396 atoms, 190056 res, box: Orthogonal, 189320 mol,
186216 solvent

INPUT TRAJECTORIES (1 total):
 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
(Orthogonal box) (reading 4 of 4)
  Coordinate processing will occur on 4 frames.

BEGIN TRAJECTORY PROCESSING:
----- vbot240520ps_11.mdcrd (1-4, 1) -----
 0% 33% 67% 100% Complete.

Read 4 frames and processed 4 frames.
TIME: Avg. throughput= 74.0069 frames / second.

ACTION OUTPUT:

RUN TIMING:
TIME: Init : 0.0002 s ( 0.41%)
TIME: Trajectory Process : 0.0540 s ( 99.09%)
TIME: Action Post : 0.0000 s ( 0.00%)
TIME: Analysis : 0.0000 s ( 0.00%)
TIME: Data File Write : 0.0000 s ( 0.00%)
TIME: Other : 0.0003 s ( 0.00%)
TIME: Run Total 0.0545 s
---------- RUN END ---------------------------------------------------
TIME: Total execution time: 20.4604 seconds.
--------------------------------------------------------------------------------
To cite CPPTRAJ use:
Daniel R. Roe and Thomas E. Cheatham, III, "PTRAJ and CPPTRAJ: Software for
  Processing and Analysis of Molecular Dynamics Trajectory Data". J. Chem.
  Theory Comput., 2013, 9 (7), pp 3084-3095.


Thank you,
Chris.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat May 19 2018 - 00:30:03 PDT
Custom Search