Thanks for reporting this, we can take a look at fixing the behavior in
amber18, but if this requires the system to "crash" at some point in the
run I'm not sure we can bulletproof against it.  There is a chronological
order for writing the .info, .out, and .rst files.  And, what you may very
well have is a restart file from well before the final frame of the
trajectory that is then used to start the subsequent segment, repeating
much more than a single frame.  In general, I think that if ls -l reveals
that different binary trajectories are of different file sizes, there is a
problem and the simulation should be re-run after erasing everything
through the first file that deviated.  I'm not sure how to write code that
would prevent errors like you are seeing, except to do something like mdgx
which allowed the user to specify multiple segments of one gigantic
trajectory and when scan any previously written segments upon startup,
fast-forwarding to the first incomplete segment.  This system introduced a
lot of complexity that ultimately I decided was not worth repeating in
future projects, and even when one tries to make use of it this is no
panacea for hardware-related problems.
Dave
On Sat, May 19, 2018 at 3:07 AM, Chris Neale <candrewn.gmail.com> wrote:
> Hello,
>
> I am reporting very rare behavior of amber16 (pmemd.cuda.MPI) in which a
> single frame of the trajectory can be lost when there is a crash after the
> .rst file is written but before the .mdcrd file is completely written
> (though it's possible that I misunderstand what is happening). I've only
> ever seen this once in a couple of years of many runs.
>
> 1) The .out file from simulation segment A lists the last timestep as N
> multiples of the save frequency
> 2) The next simulation segment, B, from the previous .rst file (generate by
> A) starts at N+1 multiples of the save frequency
> 3) The .info file from simulation A only lists N-1 multiples of the save
> frequency
> 4) The .mdcrd file from run A only contains N-1 multiples of the save
> frequency
>
> Therefore, I lost a single frame.
>
> It's obviously not a big deal, but I thought it was worth reporting.
>
> ### Here is the entire frame info in the .out file from simulation A
> (something obviously happened to the node early in the run, as there are
> only 5 frames):
>
>  NSTEP =   500000   TIME(PS) =  252000.000  TEMP(K) =   310.55  PRESS =
> 0.0
>  Etot   =  -1443981.0250  EKtot   =    511461.3750  EPtot      =
> -1955442.4000
>  BOND   =     31063.5846  ANGLE   =    104624.2155  DIHED      =
> 94789.5943
>  UB     =     38637.6267  IMP     =      1622.4678  CMAP       =
> -480.2490
>  1-4 NB =     16734.8086  1-4 EEL =     22438.6683  VDWAALS    =
> 110161.7064
>  EELEC  =  -2375034.8232  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7707174.2890
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0159
>  ------------------------------------------------------------
> ------------------
>
>
>  NSTEP =   750000   TIME(PS) =  253000.000  TEMP(K) =   310.39  PRESS =
> 0.0
>  Etot   =  -1443585.7144  EKtot   =    511200.8125  EPtot      =
> -1954786.5269
>  BOND   =     30997.3434  ANGLE   =    104278.1909  DIHED      =
> 94840.9091
>  UB     =     38883.1231  IMP     =      1552.7852  CMAP       =
> -480.8654
>  1-4 NB =     16700.8642  1-4 EEL =     22547.0194  VDWAALS    =
> 109796.3894
>  EELEC  =  -2373902.2861  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7717052.7828
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0146
>  ------------------------------------------------------------
> ------------------
>
>
>  NSTEP =  1000000   TIME(PS) =  254000.000  TEMP(K) =   310.33  PRESS =
> 0.0
>  Etot   =  -1444643.3525  EKtot   =    511106.4375  EPtot      =
> -1955749.7900
>  BOND   =     30687.3571  ANGLE   =    104125.9164  DIHED      =
> 94639.0638
>  UB     =     38576.6598  IMP     =      1583.3243  CMAP       =
> -529.8053
>  1-4 NB =     16717.5723  1-4 EEL =     23123.6355  VDWAALS    =
> 110061.3064
>  EELEC  =  -2374734.8203  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7714688.7777
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0149
>  ------------------------------------------------------------
> ------------------
>
>
>  NSTEP =  1250000   TIME(PS) =  255000.000  TEMP(K) =   309.93  PRESS =
> 0.0
>  Etot   =  -1444755.4813  EKtot   =    510449.4688  EPtot      =
> -1955204.9500
>  BOND   =     30773.5521  ANGLE   =    104686.9685  DIHED      =
> 95135.0864
>  UB     =     38725.3307  IMP     =      1579.5279  CMAP       =
> -506.7794
>  1-4 NB =     16752.8571  1-4 EEL =     22309.6579  VDWAALS    =
> 108843.7175
>  EELEC  =  -2373504.8687  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7709825.3119
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0156
>  ------------------------------------------------------------
> ------------------
>
>
>  NSTEP =  1250000   TIME(PS) =  255000.000  TEMP(K) =   309.93  PRESS =
> 0.0
>  Etot   =  -1444755.4813  EKtot   =    510449.4688  EPtot      =
> -1955204.9500
>  BOND   =     30773.5521  ANGLE   =    104686.9685  DIHED      =
> 95135.0864
>  UB     =     38725.3307  IMP     =      1579.5279  CMAP       =
> -506.7794
>  1-4 NB =     16752.8571  1-4 EEL =     22309.6579  VDWAALS    =
> 108843.7175
>  EELEC  =  -2373504.8687  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7709825.3119
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0156
>
> ### Here is the first output of a frame from the .out file from simualtion
> B:
>
>  NSTEP =   250000   TIME(PS) =  256000.000  TEMP(K) =   310.53  PRESS =
> 0.0
>  Etot   =  -1445111.8674  EKtot   =    511427.6562  EPtot      =
> -1956539.5237
>  BOND   =     30987.3060  ANGLE   =    104014.2392  DIHED      =
> 94880.5768
>  UB     =     38713.1815  IMP     =      1588.8147  CMAP       =
> -528.8742
>  1-4 NB =     16677.7195  1-4 EEL =     22418.8460  VDWAALS    =
> 110184.1128
>  EELEC  =  -2375475.4458  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7706278.4149
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0160
>
> ### Here is the .info file from simulation A:
>
>  NSTEP =  1000000   TIME(PS) =  254000.000  TEMP(K) =   310.33  PRESS =
> 0.0
>  Etot   =  -1444643.3525  EKtot   =    511106.4375  EPtot      =
> -1955749.7900
>  BOND   =     30687.3571  ANGLE   =    104125.9164  DIHED      =
> 94639.0638
>  UB     =     38576.6598  IMP     =      1583.3243  CMAP       =
> -529.8053
>  1-4 NB =     16717.5723  1-4 EEL =     23123.6355  VDWAALS    =
> 110061.3064
>  EELEC  =  -2374734.8203  EHBOND  =         0.0000  RESTRAINT  =
> 0.0000
>  EKCMT  =         0.0000  VIRIAL  =         0.0000  VOLUME     =
> 7714688.7777
>                                                     SURFTEN    =
> 0.0000
>                                                     Density    =
> 1.0149
>
>
> ### Here is output showing that the .mdcrd has only 4 frames (it should
> have 5):
>
> bash-4.2$ cpptraj -i cpptraj.inp
>
> CPPTRAJ: Trajectory Analysis. V16.16
>     ___  ___  ___  ___
>      | \/ | \/ | \/ |
>     _|_/\_|_/\_|_/\_|_
>
> | Date/time: 05/19/18 01:03:08
> | Available memory: 79.699 GB
>
> INPUT: Reading input from 'cpptraj.inp'
>   [parm bot240520ps/this.prmtop]
>     Reading 'bot240520ps/this.prmtop' as Amber Topology
>     CHAMBER topology: 1:                                CHARMM force field:
> No FF information parsed...
>   [trajin bot240520ps/vbot240520ps_11.mdcrd]
>     Reading 'bot240520ps/vbot240520ps_11.mdcrd' as Amber NetCDF
>   [list trajin]
>
> INPUT TRAJECTORIES (1 total):
>  0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
> (Orthogonal box) (reading 4 of 4)
>   Coordinate processing will occur on 4 frames.
>   [run]
> ---------- RUN BEGIN -------------------------------------------------
>
> PARAMETER FILES (1 total):
>  0: this.prmtop, 783396 atoms, 190056 res, box: Orthogonal, 189320 mol,
> 186216 solvent
>
> INPUT TRAJECTORIES (1 total):
>  0: 'vbot240520ps_11.mdcrd' is a NetCDF AMBER trajectory, Parm this.prmtop
> (Orthogonal box) (reading 4 of 4)
>   Coordinate processing will occur on 4 frames.
>
> BEGIN TRAJECTORY PROCESSING:
> ----- vbot240520ps_11.mdcrd (1-4, 1) -----
>  0% 33% 67% 100% Complete.
>
> Read 4 frames and processed 4 frames.
> TIME: Avg. throughput= 74.0069 frames / second.
>
> ACTION OUTPUT:
>
> RUN TIMING:
> TIME:        Init               : 0.0002 s (  0.41%)
> TIME:        Trajectory Process : 0.0540 s ( 99.09%)
> TIME:        Action Post        : 0.0000 s (  0.00%)
> TIME:        Analysis           : 0.0000 s (  0.00%)
> TIME:        Data File Write    : 0.0000 s (  0.00%)
> TIME:        Other              : 0.0003 s (  0.00%)
> TIME:    Run Total 0.0545 s
> ---------- RUN END ---------------------------------------------------
> TIME: Total execution time: 20.4604 seconds.
> ------------------------------------------------------------
> --------------------
> To cite CPPTRAJ use:
> Daniel R. Roe and Thomas E. Cheatham, III, "PTRAJ and CPPTRAJ: Software for
>   Processing and Analysis of Molecular Dynamics Trajectory Data". J. Chem.
>   Theory Comput., 2013, 9 (7), pp 3084-3095.
>
>
> Thank you,
> Chris.
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat May 19 2018 - 05:00:03 PDT