Hi Filip,
Okay, I did look into this a bit further, for both Amber 10 and 11, and it looks to me like the code is basically fine, and any problems experienced with incomplete restart files are most likely a result of system crashes combined with a system configured in such a way that file buffer flushing is infrequent, there is no file journaling, or some other scenario where you are more apt to be caught with a partial dump to disk of the restart file. Very strange though, at least in my experience. What pmemd does is to write the complete restart (talking formatted restarts here; the binary code is still in place but I don't think widely used, though I presume it would work) including the box information and then it does a fortran rewind, which should reposition the file ptr to the beginning of the file for the next restart write. A side effect of the rewind, though, is to also write an EOF after the last byte previously written, so the file system's view of the restart file should be very explicit and very corr
ect. So partial restart files should be really hard to get... I did not check through the runmd routine to see if there are any changes that alter restart write frequency under CUDA (there shouldn't be), but it also should not matter, in that not writing the restarts at the proper frequency still would not cause this problem
I would recommend doing shorter runs, honestly, that 100 nsec, but I have not been using the CUDA code to know how long these runs are taking. In the past, even at really high processor count, I typically did runs of around 20 nsec tops, just in case something caused irretrievable loss of the run (which honestly pretty much did not occur in any case). I am wondering if using CUDA is causing a higher frequency of node crashes, or something. Comments on such things would be welcome.
So anyway, my best guess is this is pretty much purely a system reliability problem, not an amber problem. We don't close the restart file between writes, as I said before, in the interest of efficiency. I would also mention that if one is paranoid (perhaps for good reason), one can get a series of numbered restart files (that are closed at end of write) by specifying a negative value for ntwr (please see manual); this is typically done though to get a series of system snapshots, not for reliability, I believe.
Best Regards - Bob Duke
________________________________________
From: filip fratev [filipfratev.yahoo.com]
Sent: Sunday, December 25, 2011 1:48 PM
To: amber.ambermd.org
Subject: Re: [AMBER] Restart file for pmemd not showing all information
Hi Bob,
Amber11 give me correct restart file only at the final step (when the
simulation finish), i.e. if I run 1ns simulation I will obtain a correct file only after 1ns. Thus I can Heat, Density and so on my system if
you mean that. My problem is that if I run 100ns and something wrong
happen after 50ns I am not able to restart and continue my simulation.
Moreover, what I know from Ross, if you set the same "ig" value for the
pmemd.CUDA the simulation should continue exactly in the same way. The
failure is permanent. For my test today I used the standard Amber CUDA
test files, but also as an example I can give:
&cntrl
imin=0,irest=1,ntx=5,
nstlim=50000000,dt=0.002,
ntc=2,ntf=2,ig=-1,iwrap=1,
cut=8.0, ntb=2, ntp=1,
taup=1.0,
ntpr=5000, ntwx=5000, ntwr=10000,
ntt=3, gamma_ln=2.0,
temp0=300.0,
ioutfm=1,
/
Unfortunately, I don't have Amber10 but probably can find Amber9, is it ok for these
tests? It is interesting because I know my colleagues that have the same problem but use the same OS (Suse, gcc). On the other hand from our
discussions here I know people no experiencing this problem under Suse,
as for example Marek if I am not wrong...
All the best,
Filip
________________________________
________________________________
From: "Duke, Robert E Jr" <rduke.email.unc.edu>
To: filip fratev <filipfratev.yahoo.com>; AMBER Mailing List <amber.ambermd.org>
Sent: Sunday, December 25, 2011 10:52 PM
Subject: RE: [AMBER] Restart file for pmemd not showing all information
Hi Filip,
Do you have access to pmemd 10? Can you try that? That would tell us whether it is a problem specific to your system, or Amber 11. I don't work on Amber 11 much myself, so would probably suggest that Walker's group pick it up, if it isolates to 11. I don't understand your statement that you don't use restarts much - I don't see how would get trajectories of any length without using them, but maybe you are using amber a bit differently than what I am used to. It also might not hurt if you post what your mdin looks like for these runs. What is the failure rate?
Thanks - Bob
________________________________________
From: filip fratev [filipfratev.yahoo.com]
Sent: Sunday, December 25, 2011 12:46 PM
To: AMBER Mailing List
Subject: Re: [AMBER] Restart file for pmemd not showing all information
Hi
Bob,
>Does this happen to you with Amber 11, and while using CUDA/CUDA.MPI? if you run non-CUDA pmemd.mpi , can you get it to happen?Sounds to me like you are talking small cluster systems, in-lab, correct?
Yes, I use just several individual desktop machines and Amber11. I tried again right now and the problem is the same when using both pmemd.cuda.MPI and pmemd.MPI, as well as when I use the serial version.
It is very strange. I noticed this problem one year ago but because I never used restart files I report it now here.
All the best,
Filip
________________________________
From: "Duke, Robert E Jr" <rduke.email.unc.edu>
To: filip fratev <filipfratev.yahoo.com>; AMBER Mailing List <amber.ambermd.org>
Sent: Sunday, December 25, 2011 9:03 PM
Subject: Re: [AMBER] Restart file for pmemd not showing all information
Thanks filip,
So the question for everyone with pmemd restart file problems becomes this: Does this happen to you with Amber 11, and while using CUDA/CUDA.MPI? The other question would be, "if you run non-CUDA pmemd.mpi (amber11 or amber10), can you get it to happen?". We then can distinguish between something specific to a version/build type of pmemd vs. a possible OS problem. Sounds to me like you are talking small cluster systems, in-lab, correct? (ie., you are not running at one of the big supercomputer centers with some sort of super-optimized parallel file system).
Best Regards - Bob Duke
________________________________________
From: filip fratev [filipfratev.yahoo.com]
Sent: Sunday, December 25, 2011 3:30 AM
To: AMBER Mailing List
Subject: Re: [AMBER] Restart file for pmemd not showing all information
Hi all,
Marry Christmas and
happy New Year!
I have the same
problem - some atoms missing and no any information about the box. I never obtained
full restart file during the simulations. I use pmemd.CUDA and CUDA.MPI compiled with
gcc4.3, 4.5 and 4.6 on different systems under Suse11.3, 11.4 and 12.1. The
only proper restart files are those obtained after the end of the simulation.
What might be
the problem and how to solve it?
All the best,
Filip
________________________________
From: Bill Ross <ross.cgl.ucsf.EDU>
To: amber.ambermd.org
Sent: Saturday, December 24, 2011 11:26 PM
Subject: Re: [AMBER] Restart file for pmemd not showing all information
> If memory serves, really the only way we could flush the buffers during
> a run was an actual close and reopen cycle
How about flush()?
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gfortran/FLUSH.html
Though I think close/open would be easier to trust.
Bill
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Dec 27 2011 - 18:30:03 PST