Re: [AMBER] Amber16 pmemd.cuda.MPI simulation can drop a frame with an inopportune node crash

From: David Cerutti <dscerutti.gmail.com>
Date: Sat, 19 May 2018 07:30:27 -0400

Thanks for reporting this, we can take a look at fixing the behavior in
amber18, but if this requires the system to "crash" at some point in the
run I'm not sure we can bulletproof against it. There is a chronological
order for writing the .info, .out, and .rst files. And, what you may very
well have is a restart file from well before the final frame of the
trajectory that is then used to start the subsequent segment, repeating
much more than a single frame. In general, I think that if ls -l reveals
that different binary trajectories are of different file sizes, there is a
problem and the simulation should be re-run after erasing everything
through the first file that deviated. I'm not sure how to write code that
would prevent errors like you are seeing, except to do something like mdgx
which allowed the user to specify multiple segments of one gigantic
trajectory and when scan any previously written segments upon startup,
fast-forwarding to the first incomplete segment. This system introduced a
lot of complexity that ultimately I decided was not worth repeating in
future projects, and even when one tries to make use of it this is no
panacea for hardware-related problems.

Dave


On Sat, May 19, 2018 at 3:07 AM, Chris Neale <candrewn.gmail.com> wrote:

> Hello,
>
> I am reporting very rare behavior of amber16 (pmemd.cuda.MPI) in which a
> single frame of the trajectory can be lost when there is a crash after the
> .rst file is written but before the .mdcrd file is completely written
> (though it's possible that I misunderstand what is happening). I've only
> ever seen this once in a couple of years of many runs.
>
> 1) The .out file from simulation segment A lists the last timestep as N
> multiples of the save frequency
> 2) The next simulation segment, B, from the previous .rst file (generate by
> A) starts at N+1 multiples of the save frequency
> 3) The .info file from simulation A only lists N-1 multiples of the save
> frequency
> 4) The .mdcrd file from run A only contains N-1 multiples of the save
> frequency
>
> Therefore, I lost a single frame.
>
> It's obviously not a big deal, but I thought it was worth reporting.
>
> ### Here is the entire frame info in the .out file from simulation A
> (something obviously happened to the node early in the run, as there are
> only 5 frames):
>
> NSTEP = 500000 TIME(PS) = 252000.000 TEMP(K) = 310.55 PRESS =
> 0.0
> Etot = -1443981.0250 EKtot = 511461.3750 EPtot =
> -1955442.4000
> BOND = 31063.5846 ANGLE = 104624.2155 DIHED =
> 94789.5943
> UB = 38637.6267 IMP = 1622.4678 CMAP =
> -480.2490
> 1-4 NB = 16734.8086 1-4 EEL = 22438.6683 VDWAALS =
> 110161.7064
> EELEC = -2375034.8232 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7707174.2890
> SURFTEN =
> 0.0000
> Density =
> 1.0159
> ------------------------------------------------------------
> ------------------
>
>
> NSTEP = 750000 TIME(PS) = 253000.000 TEMP(K) = 310.39 PRESS =
> 0.0
> Etot = -1443585.7144 EKtot = 511200.8125 EPtot =
> -1954786.5269
> BOND = 30997.3434 ANGLE = 104278.1909 DIHED =
> 94840.9091
> UB = 38883.1231 IMP = 1552.7852 CMAP =
> -480.8654
> 1-4 NB = 16700.8642 1-4 EEL = 22547.0194 VDWAALS =
> 109796.3894
> EELEC = -2373902.2861 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7717052.7828
> SURFTEN =
> 0.0000
> Density =
> 1.0146
> ------------------------------------------------------------
> ------------------
>
>
> NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
> 0.0
> Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
> -1955749.7900
> BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
> 94639.0638
> UB = 38576.6598 IMP = 1583.3243 CMAP =
> -529.8053
> 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
> 110061.3064
> EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7714688.7777
> SURFTEN =
> 0.0000
> Density =
> 1.0149
> ------------------------------------------------------------
> ------------------
>
>
> NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
> 0.0
> Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
> -1955204.9500
> BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
> 95135.0864
> UB = 38725.3307 IMP = 1579.5279 CMAP =
> -506.7794
> 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
> 108843.7175
> EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7709825.3119
> SURFTEN =
> 0.0000
> Density =
> 1.0156
> ------------------------------------------------------------
> ------------------
>
>
> NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
> 0.0
> Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
> -1955204.9500
> BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
> 95135.0864
> UB = 38725.3307 IMP = 1579.5279 CMAP =
> -506.7794
> 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
> 108843.7175
> EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7709825.3119
> SURFTEN =
> 0.0000
> Density =
> 1.0156
>
> ### Here is the first output of a frame from the .out file from simualtion
> B:
>
> NSTEP = 250000 TIME(PS) = 256000.000 TEMP(K) = 310.53 PRESS =
> 0.0
> Etot = -1445111.8674 EKtot = 511427.6562 EPtot =
> -1956539.5237
> BOND = 30987.3060 ANGLE = 104014.2392 DIHED =
> 94880.5768
> UB = 38713.1815 IMP = 1588.8147 CMAP =
> -528.8742
> 1-4 NB = 16677.7195 1-4 EEL = 22418.8460 VDWAALS =
> 110184.1128
> EELEC = -2375475.4458 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7706278.4149
> SURFTEN =
> 0.0000
> Density =
> 1.0160
>
> ### Here is the .info file from simulation A:
>
> NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
> 0.0
> Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
> -1955749.7900
> BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
> 94639.0638
> UB = 38576.6598 IMP = 1583.3243 CMAP =
> -529.8053
> 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
> 110061.3064
> EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> 7714688.7777
> SURFTEN =
> 0.0000
> Density =
> 1.0149
>
>
> ### Here is output showing that the .mdcrd has only 4 frames (it should
> have 5):
>
> bash-4.2$ cpptraj -i cpptraj.inp
>
> CPPTRAJ: Trajectory Analysis. V16.16
> ___ ___ ___ ___
> | \/ | \/ | \/ |
> _|_/\_|_/\_|_/\_|_
>
> | Date/time: 05/19/18 01:03:08
> | Available memory: 79.699 GB
>
> INPUT: Reading input from 'cpptraj.inp'
> [parm bot240520ps/this.prmtop]
> Reading 'bot240520ps/this.prmtop' as Amber Topology
> CHAMBER topology: 1: CHARMM force field:
> No FF information parsed...
> [trajin bot240520ps/vbot240520ps_11.mdcrd]
> Reading 'bot240520ps/vbot240520ps_11.mdcrd' as Amber NetCDF
> [list trajin]
>
> INPUT TRAJECTORIES (1 total):
> 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
> (Orthogonal box) (reading 4 of 4)
> Coordinate processing will occur on 4 frames.
> [run]
> ---------- RUN BEGIN -------------------------------------------------
>
> PARAMETER FILES (1 total):
> 0: this.prmtop, 783396 atoms, 190056 res, box: Orthogonal, 189320 mol,
> 186216 solvent
>
> INPUT TRAJECTORIES (1 total):
> 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
> (Orthogonal box) (reading 4 of 4)
> Coordinate processing will occur on 4 frames.
>
> BEGIN TRAJECTORY PROCESSING:
> ----- vbot240520ps_11.mdcrd (1-4, 1) -----
> 0% 33% 67% 100% Complete.
>
> Read 4 frames and processed 4 frames.
> TIME: Avg. throughput= 74.0069 frames / second.
>
> ACTION OUTPUT:
>
> RUN TIMING:
> TIME: Init : 0.0002 s ( 0.41%)
> TIME: Trajectory Process : 0.0540 s ( 99.09%)
> TIME: Action Post : 0.0000 s ( 0.00%)
> TIME: Analysis : 0.0000 s ( 0.00%)
> TIME: Data File Write : 0.0000 s ( 0.00%)
> TIME: Other : 0.0003 s ( 0.00%)
> TIME: Run Total 0.0545 s
> ---------- RUN END ---------------------------------------------------
> TIME: Total execution time: 20.4604 seconds.
> ------------------------------------------------------------
> --------------------
> To cite CPPTRAJ use:
> Daniel R. Roe and Thomas E. Cheatham, III, "PTRAJ and CPPTRAJ: Software for
> Processing and Analysis of Molecular Dynamics Trajectory Data". J. Chem.
> Theory Comput., 2013, 9 (7), pp 3084-3095.
>
>
> Thank you,
> Chris.
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat May 19 2018 - 05:00:03 PDT
Custom Search