Re: [AMBER] Restart file for pmemd not showing all information

From: Aron Broom <broomsday.gmail.com>
Date: Tue, 14 Feb 2012 12:40:59 -0500

I have encountered similar problems with restart files. There seems to be
~50% chance of them missing (as Mohd says) ~50 lines or so of coordinates.
I have only checked for this when a simulation terminates unexpectedly, so
I can't claim that this is generally happening. My impression was that
this resulted from AMBER attempting to write the restart file while the job
is terminating, and as such it could be easily solved by always having two
restart files, such that at the next restart time the previous file is
renamed to restart.old or something and the next one is written.

As I say, I've never looked for bad restart files being written while the
simulation is running.

~Aron

On Tue, Feb 14, 2012 at 12:33 PM, Ismail, Mohd F. <farid.ou.edu> wrote:

> I have to agree with Filip here. If I run a simulation for 1,000,000
> steps and my restart file is written every 10,000 steps, the intermediate
> restart files written will not have the complete information. I check
> while the simulation is running, as well as if I kill the job. The restart
> file will not have the PBC box info, as well as a few lines (~50) of
> coordinates.
>
> I'm running CentOS 6. This is not on a cluster, just a single workstation.
>
> *******************************
> Mohd Farid Ismail
> Graduate Student
> Dept. of Chemistry/Biochemistry
> University of Oklahoma
> Norman 73019
>
> ________________________________________
> From: filip fratev [filipfratev.yahoo.com]
> Sent: Tuesday, February 14, 2012 10:38 AM
> To: AMBER Mailing List
> Cc: rduke.email.unc.edu
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> Hi all,
> I was not able
> to solve my problem with restart files. We got 4 PC's with completely
> different hardware except the GPU's (GTX580's) and OS - SUSE 11.3-11.4
> installed
> on all PC's. I was wondering what could provoke this problem. Could someone
> using Suse to reproduce my problem?
>
> All the best,
> Filip
>
>
> ________________________________
> From: "Duke, Robert E Jr" <rduke.email.unc.edu>
> To: filip fratev <filipfratev.yahoo.com>; AMBER Mailing List <
> amber.ambermd.org>
> Sent: Monday, December 26, 2011 12:30 AM
> Subject: RE: [AMBER] Restart file for pmemd not showing all information
>
> Hi Filip,
> Hmmm, I am thinking that what you may have here is OS-specific. I am glad
> to hear that it is only an "interrupted run" problem, but then I I think
> you are solely at the mercy of how the machine does file system buffering.
> Still, a correctly operating machine should not simply be dropping bytes
> on the floor if the job terminates unexpectedly, and even then, currently
> used journaling filesystems should rarely miss much, even if the OS
> crashes. So I presume you are dealing with unexpected job termination, not
> unexpected machine crashing? I am going to act like it is Christmas (ie.,
> stop working for the day), but if you give me the info about job crash vs.
> machine crash, I'll think about it all a little more early next week. I
> need to review the code, but it is my guess that I never designed
> restart-writing to survive absolutely everything that could go wrong with
> the machine (ie., I don't believe I close and reopen the file at every
> write, and
> that is what you need to do to be certain that all file system buffers
> are more-or-less immediately flushed to disk - doing this would be a
> definite performance hit I expect, especially in parallel, as it stalls the
> master process). So it is my guess this is in no way a bug, but if you are
> having machines crashing left and right, could be a heck of an annoyance
> (but I would think that the crashing itself would be something that really
> ought to be addressed..., if that is what is happening).
> Best Regards - Bob
>
> ________________________________________
> From: filip fratev [filipfratev.yahoo.com]
> Sent: Sunday, December 25, 2011 1:48 PM
> To: amber.ambermd.org
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> Hi Bob,
> Amber11 give me correct restart file only at the final step (when the
> simulation finish), i.e. if I run 1ns simulation I will obtain a correct
> file only after 1ns. Thus I can Heat, Density and so on my system if
> you mean that. My problem is that if I run 100ns and something wrong
> happen after 50ns I am not able to restart and continue my simulation.
> Moreover, what I know from Ross, if you set the same "ig" value for the
> pmemd.CUDA the simulation should continue exactly in the same way. The
> failure is permanent. For my test today I used the standard Amber CUDA
> test files, but also as an example I can give:
> &cntrl
> imin=0,irest=1,ntx=5,
> nstlim=50000000,dt=0.002,
> ntc=2,ntf=2,ig=-1,iwrap=1,
> cut=8.0, ntb=2, ntp=1,
> taup=1.0,
> ntpr=5000, ntwx=5000, ntwr=10000,
> ntt=3, gamma_ln=2.0,
> temp0=300.0,
> ioutfm=1,
> /
>
> Unfortunately, I don't have Amber10 but probably can find Amber9, is it ok
> for these
> tests? It is interesting because I know my colleagues that have the same
> problem but use the same OS (Suse, gcc). On the other hand from our
> discussions here I know people no experiencing this problem under Suse,
> as for example Marek if I am not wrong...
>
> All the best,
> Filip
>
>
>
>
>
> ________________________________
>
>
>
>
> ________________________________
> From: "Duke, Robert E Jr" <rduke.email.unc.edu>
> To: filip fratev <filipfratev.yahoo.com>; AMBER Mailing List <
> amber.ambermd.org>
> Sent: Sunday, December 25, 2011 10:52 PM
> Subject: RE: [AMBER] Restart file for pmemd not showing all information
>
> Hi Filip,
> Do you have access to pmemd 10? Can you try that? That would tell us
> whether it is a problem specific to your system, or Amber 11. I don't work
> on Amber 11 much myself, so would probably suggest that Walker's group pick
> it up, if it isolates to 11. I don't understand your statement that you
> don't use restarts much - I don't see how would get trajectories of any
> length without using them, but maybe you are using amber a bit differently
> than what I am used to. It also might not hurt if you post what your mdin
> looks like for these runs. What is the failure rate?
> Thanks - Bob
>
> ________________________________________
> From: filip fratev [filipfratev.yahoo.com]
> Sent: Sunday, December 25, 2011 12:46 PM
> To: AMBER Mailing List
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> Hi
> Bob,
> >Does this happen to you with Amber 11, and while using CUDA/CUDA.MPI? if
> you run non-CUDA pmemd.mpi , can you get it to happen?Sounds to me like you
> are talking small cluster systems, in-lab, correct?
>
> Yes, I use just several individual desktop machines and Amber11. I tried
> again right now and the problem is the same when using both pmemd.cuda.MPI
> and pmemd.MPI, as well as when I use the serial version.
> It is very strange. I noticed this problem one year ago but because I
> never used restart files I report it now here.
>
> All the best,
> Filip
>
>
>
>
> ________________________________
> From: "Duke, Robert E Jr" <rduke.email.unc.edu>
> To: filip fratev <filipfratev.yahoo.com>; AMBER Mailing List <
> amber.ambermd.org>
> Sent: Sunday, December 25, 2011 9:03 PM
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> Thanks filip,
> So the question for everyone with pmemd restart file problems becomes
> this: Does this happen to you with Amber 11, and while using
> CUDA/CUDA.MPI? The other question would be, "if you run non-CUDA pmemd.mpi
> (amber11 or amber10), can you get it to happen?". We then can distinguish
> between something specific to a version/build type of pmemd vs. a possible
> OS problem. Sounds to me like you are talking small cluster systems,
> in-lab, correct? (ie., you are not running at one of the big supercomputer
> centers with some sort of super-optimized parallel file system).
> Best Regards - Bob Duke
>
> ________________________________________
> From: filip fratev [filipfratev.yahoo.com]
> Sent: Sunday, December 25, 2011 3:30 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> Hi all,
> Marry Christmas and
> happy New Year!
>
> I have the same
> problem - some atoms missing and no any information about the box. I never
> obtained
> full restart file during the simulations. I use pmemd.CUDA and CUDA.MPI
> compiled with
> gcc4.3, 4.5 and 4.6 on different systems under Suse11.3, 11.4 and 12.1. The
> only proper restart files are those obtained after the end of the
> simulation.
>
> What might be
> the problem and how to solve it?
>
>
> All the best,
> Filip
>
>
> ________________________________
> From: Bill Ross <ross.cgl.ucsf.EDU>
> To: amber.ambermd.org
> Sent: Saturday, December 24, 2011 11:26 PM
> Subject: Re: [AMBER] Restart file for pmemd not showing all information
>
> > If memory serves, really the only way we could flush the buffers during
> > a run was an actual close and reopen cycle
>
> How about flush()?
>
> http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gfortran/FLUSH.html
>
> Though I think close/open would be easier to trust.
>
> Bill
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 14 2012 - 10:00:02 PST
Custom Search