RE: [AMBER] Number of Cycles from Ross Walker on 2009-12-15 (Amber Archive Dec 2009)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 15 Dec 2009 11:01:40 -0800

To add my advice here. I normally run something along the lines of:

ntwx=1000, ntwr=10000,

I tend to keep ntwr as large as possible since it does benefit performance,
especially on slow disk systems, as does keeping ntwx large if you can. I
typically tweak ntwr so it represents about 1 hour of run time. This way if
a job dies you lose at most 1 hour of simulation. You can vary this based on
your tolerance for wasting SUs etc.

If a job then crashes an you want to restart it you simply look in the top
of the restart file and see what time this represents. Then you can work out
how many frames you should have in your trajectory file to represent this
time, run it through ptraj and ditch the extra ones on the end - this will
be between 0 and 9 extra frames given the difference between ntwx and ntwr.

You can then also edit your output file and delete the extra output in the
end of it (since I typically have ntpr < ntwx < ntwr) so it represents the
time for the restart file you are using.

An alternative approach, and one I use often since CPU hours are generally a
lot less valuable than human hours is just to set ntwr = nstlim (or 0). Then
set your job up so it can complete nstlim steps in about 12 hours for a
reasonable simulation length, say 2.0ns (and significantly less than the
wallclock limit of say 14 hours). I would then run a series of jobs like
this:

Job Script 1
mpirun -np 256 $AMBERHOME/exe/pmemd -O -o mdin.2ns -o mdout.0-2ns -p prmtop
-c inpcrd -x mdcrd.0-2ns -r restrt.0-2ns

Job Script 2
mpirun -np 256 $AMBERHOME/exe/pmemd -O -o mdin.2ns -o mdout.2-4ns -p prmtop
-c inpcrd -x mdcrd.2-4ns -r restrt.2-4ns

Job Script 3
mpirun -np 256 $AMBERHOME/exe/pmemd -O -o mdin.2ns -o mdout.4-6ns -p prmtop
-c inpcrd -x mdcrd.4-6ns -r restrt.4-6ns

Etc...

Then you submit the first job to your queuing script. Job 2 you submit as
having a dependency on job 1 completing successfully, job 3 as having a
dependency on job 2 etc. Most queuing systems let you do this.

Then if a job fails it dies and all subsequent dependencies are killed. I
get an email telling me this. I then login, find the last job that crashed
and delete it's output file and mdcrd file (there will be no restart file
since it did not finish and I only write the restart at the end.). Then I
just resubmit that job and all subsequent dependencies.

This way my output files are always neat and tidy, my trajectory files
always begin at a nice point in time and include the exact same number of
frames etc etc. I don't have to worry about cleaning things up in ptraj / vi
etc. At worst I lose 12 * 256 = 3072 cpu hours but this really isn't a big
deal, especially since any crashes will be due to hardware failure and thus
the fault of the machine owner / operator.

This approach saves a considerable amount of human effort.

So just my 2 cents + 5 cents of freshly printed bailout money.

All the best
Ross

> -----Original Message-----
> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> Behalf Of Niel Henriksen
> Sent: Tuesday, December 15, 2009 10:11 AM
> To: AMBER Mailing List
> Subject: RE: [AMBER] Number of Cycles
>
> >If nstlim is evenly divisible by ntwx, then the last mdcrd trajectory
> frame
> >written should I believe correspond to the final restart file. The
> >important thing to remember, is that the restart file will always be
> written
> >at the final step of the run, regardless of the value of ntwr.
>
> Yes I agree. If all my jobs ended before the wallclock limit there
> would be no
> problem. However, I am greedy with every second I get, so all of my
> jobs get
> killed before they end "normally". Thus, to ensure that I don't have
> redundant
> data, I like to write restart files with every trajectory frame. (I
> also write 2 restart
> files each ntwr so that if one gets only partially written I have a
> back-up). I suppose
> I should evaluate whether this approach maximizes the use of resources
> and
> minimizes the total (real) time to complete a simulation.
>
> Off the top your head, if I use somewhere between 32 - 64 processors on
> a
> teragrid machine (say ranger or kraken) for a simulation with 40,000
> atoms,
> would I get a big performance impact with ntpr=ntwx=ntwr=500?
>
> Thanks,
> --Niel
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Dec 15 2009 - 11:30:03 PST