Re: [AMBER] Early termination of parallel MD from Jason Swails on 2014-06-30 (Amber Archive Jun 2014)

From: Jason Swails <jason.swails.gmail.com>
Date: Mon, 30 Jun 2014 06:30:44 -0700

On Mon, Jun 30, 2014 at 12:52 AM, Valentina Romano <
valentina.romano.unibas.ch> wrote:

> Dear Amber users
>
> I want to run a MD in parallel.
> The input file is:
>
> #!/bin/bash -l
> #$ -N PknGAde_md
> #$ -l membycore=1G
> #$ -l runtime=50:00:00
> #$ -pe ompi 32
> #$ -cwd
> ##$ -o $HOME/queue/stdout
> ##$ -e $HOME/queue/stderr
>
> module load ictce/6.2.5
>
> export AMBERHOME=/import/bc2/home/schwede/romanov/amber12-amd
> export PATH=$AMBERHOME/bin:$PATH
>
> #echo "Got $NSLOTS processors."
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md01.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_equil.rst -r
> PknGAde_md01.rst -x PknGAde_md01.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md02.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md01.rst -r
> PknGAde_md02.rst -x PknGAde_md02.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md03.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md02.rst -r
> PknGAde_md03.rst -x PknGAde_md03.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md04.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md03.rst -r
> PknGAde_md04.rst -x PknGAde_md04.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md05.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md04.rst -r
> PknGAde_md05.rst -x PknGAde_md05.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md06.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md05.rst -r
> PknGAde_md06.rst -x PknGAde_md06.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md07.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md06.rst -r
> PknGAde_md07.rst -x PknGAde_md07.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md08.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md07.rst -r
> PknGAde_md08.rst -x PknGAde_md08.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md09.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md08.rst -r
> PknGAde_md09.rst -x PknGAde_md09.mdcrd
> mpirun -v -np $NSLOTS pmemd.MPI -O -i PknGAde_md.in -o PknGAde_md10.out -p
> ../PknGAde_params/PknGHAdeH_ion_wt.prmtop -c PknGAde_md09.rst -r
> PknGAde_md10.rst -x PknGAde_md10.mdcrd
>
> Where PknGAde_md.in is:
>
> &cntrl
> imin=0,
> irest=1,
> ig=-1,
> ntx=7,
> ntb=2,
> ntp=1,
> taup=2.0,
> igb=0,
> ntr=0,
> tempi=300.0, temp0=300.0,
> ntt=3, gamma_ln=1.0,
> ntc=2,
> ntf=2,
> cut=12.0,
>

This is an awfully large cutoff and will make your simulations run quite
slow without improving accuracy. I would suggest a smaller value ~8 to 9
Angstroms. By default, Amber computes the full long-range electrostatics
(not truncated at all) using the Particle-Mesh Ewald method, and includes a
long-range correction for vdW terms that is typically quite good for
homogenous systems.

> nstlim=500000, dt=0.002,
> ntpr=500, ntwx=500, ntwr=1000
>
> Since I want to run a 10ns MD, each PknGAde_md.in is of 50000 steps
> (dt=0.002) and it is run 10 times.
>
> When I run the script for the MD in parallel, it works fine for the first
> step. Afterwards the second steps did not start and I do not understand why.
> I did not get any error messages and it looks to me that the input for the
> parallel job is not correct and the job stops after the first step (first
> 500000 steps).
>
> Any suggestion?
>

A couple comments. Without an error message, it's impossible to tell what
went wrong. It could be a filesystem issue. It could be that the system
got corrupted after the first simulation. It could be your queuing system.
It could be a lot of things. There _must_ have been some type of error
message either printed to stdout, stderr, or one of the Amber mdout files.
Make sure you look at the stdout and stderr streams that were dumped by
your queuing system as well (it looks like you commented out their
redirection in your submission script, so I'm not sure where they would
have gone).

Also, what is the point of breaking up your simulation into 10 chunks, only
to run them all back-to-back in the same submission script? In my
experience, the motivation for breaking them up like that was so you could
submit 10 different jobs with different scripts. If you are going to run
them all in the same script, anyway, why not just run 1 simulation that is
10 times longer?

Also, carefully inspect the output files from the first step and look for
any anomalies (**'s in the restart file, very high energies in the mdout
file, corruption of the trajectory). Visualize everything (use
mdout_analyzer.py to plot the various energy contributions to make sure
everything looks sane). If everything looks fine, try running the second
simulation by hand. Does it work? If it does, then try splitting each of
the steps into different scripts (you can typically submit them all at the
same time and set up dependencies so each job will be held until the one
that came before it finished). If it doesn't, take note of the error
message either printed to the screen or printed in the mdout file. That
should help you debug.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Mon Jun 30 2014 - 07:00:02 PDT