Re: [AMBER] Amber16 pmemd.cuda.MPI simulation can drop a frame with an inopportune node crash

From: Chris Neale <candrewn.gmail.com>
Date: Sat, 19 May 2018 16:04:04 -0600

Dear Dave:

you are right that frequently I have a restart file from well before the
final frame of the trajectory. In this case, it is verified to be the other
way around. I think your idea of optionally loading in the previous
trajectory to pmemd such that it can check for missing frames compared to
the restart sounds like a great idea. But then again I could always script
this evaluation myself, so I've got a usable solution moving forward.

Thank you for your reply,
Chris.

On Sat, May 19, 2018 at 5:30 AM, David Cerutti <dscerutti.gmail.com> wrote:

> Thanks for reporting this, we can take a look at fixing the behavior in
> amber18, but if this requires the system to "crash" at some point in the
> run I'm not sure we can bulletproof against it. There is a chronological
> order for writing the .info, .out, and .rst files. And, what you may very
> well have is a restart file from well before the final frame of the
> trajectory that is then used to start the subsequent segment, repeating
> much more than a single frame. In general, I think that if ls -l reveals
> that different binary trajectories are of different file sizes, there is a
> problem and the simulation should be re-run after erasing everything
> through the first file that deviated. I'm not sure how to write code that
> would prevent errors like you are seeing, except to do something like mdgx
> which allowed the user to specify multiple segments of one gigantic
> trajectory and when scan any previously written segments upon startup,
> fast-forwarding to the first incomplete segment. This system introduced a
> lot of complexity that ultimately I decided was not worth repeating in
> future projects, and even when one tries to make use of it this is no
> panacea for hardware-related problems.
>
> Dave
>
>
> On Sat, May 19, 2018 at 3:07 AM, Chris Neale <candrewn.gmail.com> wrote:
>
> > Hello,
> >
> > I am reporting very rare behavior of amber16 (pmemd.cuda.MPI) in which a
> > single frame of the trajectory can be lost when there is a crash after
> the
> > .rst file is written but before the .mdcrd file is completely written
> > (though it's possible that I misunderstand what is happening). I've only
> > ever seen this once in a couple of years of many runs.
> >
> > 1) The .out file from simulation segment A lists the last timestep as N
> > multiples of the save frequency
> > 2) The next simulation segment, B, from the previous .rst file (generate
> by
> > A) starts at N+1 multiples of the save frequency
> > 3) The .info file from simulation A only lists N-1 multiples of the save
> > frequency
> > 4) The .mdcrd file from run A only contains N-1 multiples of the save
> > frequency
> >
> > Therefore, I lost a single frame.
> >
> > It's obviously not a big deal, but I thought it was worth reporting.
> >
> > ### Here is the entire frame info in the .out file from simulation A
> > (something obviously happened to the node early in the run, as there are
> > only 5 frames):
> >
> > NSTEP = 500000 TIME(PS) = 252000.000 TEMP(K) = 310.55 PRESS =
> > 0.0
> > Etot = -1443981.0250 EKtot = 511461.3750 EPtot =
> > -1955442.4000
> > BOND = 31063.5846 ANGLE = 104624.2155 DIHED =
> > 94789.5943
> > UB = 38637.6267 IMP = 1622.4678 CMAP =
> > -480.2490
> > 1-4 NB = 16734.8086 1-4 EEL = 22438.6683 VDWAALS =
> > 110161.7064
> > EELEC = -2375034.8232 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7707174.2890
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0159
> > ------------------------------------------------------------
> > ------------------
> >
> >
> > NSTEP = 750000 TIME(PS) = 253000.000 TEMP(K) = 310.39 PRESS =
> > 0.0
> > Etot = -1443585.7144 EKtot = 511200.8125 EPtot =
> > -1954786.5269
> > BOND = 30997.3434 ANGLE = 104278.1909 DIHED =
> > 94840.9091
> > UB = 38883.1231 IMP = 1552.7852 CMAP =
> > -480.8654
> > 1-4 NB = 16700.8642 1-4 EEL = 22547.0194 VDWAALS =
> > 109796.3894
> > EELEC = -2373902.2861 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7717052.7828
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0146
> > ------------------------------------------------------------
> > ------------------
> >
> >
> > NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
> > 0.0
> > Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
> > -1955749.7900
> > BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
> > 94639.0638
> > UB = 38576.6598 IMP = 1583.3243 CMAP =
> > -529.8053
> > 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
> > 110061.3064
> > EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7714688.7777
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0149
> > ------------------------------------------------------------
> > ------------------
> >
> >
> > NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
> > 0.0
> > Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
> > -1955204.9500
> > BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
> > 95135.0864
> > UB = 38725.3307 IMP = 1579.5279 CMAP =
> > -506.7794
> > 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
> > 108843.7175
> > EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7709825.3119
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0156
> > ------------------------------------------------------------
> > ------------------
> >
> >
> > NSTEP = 1250000 TIME(PS) = 255000.000 TEMP(K) = 309.93 PRESS =
> > 0.0
> > Etot = -1444755.4813 EKtot = 510449.4688 EPtot =
> > -1955204.9500
> > BOND = 30773.5521 ANGLE = 104686.9685 DIHED =
> > 95135.0864
> > UB = 38725.3307 IMP = 1579.5279 CMAP =
> > -506.7794
> > 1-4 NB = 16752.8571 1-4 EEL = 22309.6579 VDWAALS =
> > 108843.7175
> > EELEC = -2373504.8687 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7709825.3119
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0156
> >
> > ### Here is the first output of a frame from the .out file from
> simualtion
> > B:
> >
> > NSTEP = 250000 TIME(PS) = 256000.000 TEMP(K) = 310.53 PRESS =
> > 0.0
> > Etot = -1445111.8674 EKtot = 511427.6562 EPtot =
> > -1956539.5237
> > BOND = 30987.3060 ANGLE = 104014.2392 DIHED =
> > 94880.5768
> > UB = 38713.1815 IMP = 1588.8147 CMAP =
> > -528.8742
> > 1-4 NB = 16677.7195 1-4 EEL = 22418.8460 VDWAALS =
> > 110184.1128
> > EELEC = -2375475.4458 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7706278.4149
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0160
> >
> > ### Here is the .info file from simulation A:
> >
> > NSTEP = 1000000 TIME(PS) = 254000.000 TEMP(K) = 310.33 PRESS =
> > 0.0
> > Etot = -1444643.3525 EKtot = 511106.4375 EPtot =
> > -1955749.7900
> > BOND = 30687.3571 ANGLE = 104125.9164 DIHED =
> > 94639.0638
> > UB = 38576.6598 IMP = 1583.3243 CMAP =
> > -529.8053
> > 1-4 NB = 16717.5723 1-4 EEL = 23123.6355 VDWAALS =
> > 110061.3064
> > EELEC = -2374734.8203 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME =
> > 7714688.7777
> > SURFTEN =
> > 0.0000
> > Density =
> > 1.0149
> >
> >
> > ### Here is output showing that the .mdcrd has only 4 frames (it should
> > have 5):
> >
> > bash-4.2$ cpptraj -i cpptraj.inp
> >
> > CPPTRAJ: Trajectory Analysis. V16.16
> > ___ ___ ___ ___
> > | \/ | \/ | \/ |
> > _|_/\_|_/\_|_/\_|_
> >
> > | Date/time: 05/19/18 01:03:08
> > | Available memory: 79.699 GB
> >
> > INPUT: Reading input from 'cpptraj.inp'
> > [parm bot240520ps/this.prmtop]
> > Reading 'bot240520ps/this.prmtop' as Amber Topology
> > CHAMBER topology: 1: CHARMM force
> field:
> > No FF information parsed...
> > [trajin bot240520ps/vbot240520ps_11.mdcrd]
> > Reading 'bot240520ps/vbot240520ps_11.mdcrd' as Amber NetCDF
> > [list trajin]
> >
> > INPUT TRAJECTORIES (1 total):
> > 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm
> this.prmtop
> > (Orthogonal box) (reading 4 of 4)
> > Coordinate processing will occur on 4 frames.
> > [run]
> > ---------- RUN BEGIN -------------------------------------------------
> >
> > PARAMETER FILES (1 total):
> > 0: this.prmtop, 783396 atoms, 190056 res, box: Orthogonal, 189320 mol,
> > 186216 solvent
> >
> > INPUT TRAJECTORIES (1 total):
> > 0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm
> this.prmtop
> > (Orthogonal box) (reading 4 of 4)
> > Coordinate processing will occur on 4 frames.
> >
> > BEGIN TRAJECTORY PROCESSING:
> > ----- vbot240520ps_11.mdcrd (1-4, 1) -----
> > 0% 33% 67% 100% Complete.
> >
> > Read 4 frames and processed 4 frames.
> > TIME: Avg. throughput= 74.0069 frames / second.
> >
> > ACTION OUTPUT:
> >
> > RUN TIMING:
> > TIME: Init : 0.0002 s ( 0.41%)
> > TIME: Trajectory Process : 0.0540 s ( 99.09%)
> > TIME: Action Post : 0.0000 s ( 0.00%)
> > TIME: Analysis : 0.0000 s ( 0.00%)
> > TIME: Data File Write : 0.0000 s ( 0.00%)
> > TIME: Other : 0.0003 s ( 0.00%)
> > TIME: Run Total 0.0545 s
> > ---------- RUN END ---------------------------------------------------
> > TIME: Total execution time: 20.4604 seconds.
> > ------------------------------------------------------------
> > --------------------
> > To cite CPPTRAJ use:
> > Daniel R. Roe and Thomas E. Cheatham, III, "PTRAJ and CPPTRAJ: Software
> for
> > Processing and Analysis of Molecular Dynamics Trajectory Data". J.
> Chem.
> > Theory Comput., 2013, 9 (7), pp 3084-3095.
> >
> >
> > Thank you,
> > Chris.
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat May 19 2018 - 15:30:02 PDT
Custom Search