Re: [AMBER] hpc error

From: David A Case <>
Date: Tue, 18 Aug 2015 10:12:59 -0400

On Tue, Aug 18, 2015, Damiano Spadoni wrote:
> I am trying to run the following script, involving a production step of
> my simulation, on an hpc facility:
> #!/bin/bash --login
> #PBS -N S4F4_md
> #PBS -l walltime=24:00:00
> #PBS -l select=4
> #PBS -j oe
> #PBS -A e280-Croft
> module load amber
> cd /work/e280/e280/enxds6/SF4
> aprun -n 96 sander.MPI -O -i -o S4F4_md1.out -p SF4.prmtop -c
> S4F4_heat.rst -r S4F4_md1.rst -x S4F4_md1.mdcrd
> aprun -n 96 sander.MPI -O -i -o S4F4_md2.out -p SF4.prmtop -c
> S4F4_md1.rst -r S4F4_md2.rst -x S4F4_md2.mdcrd

Oh my goodness....

(1) When things are not working, start simple (a single sander run, not 6
of then in sequence).
(2) sander will almost certainly perform very badly (if at all)
on 96 processors. Try running on 2 or 4 to start. Then you can test the
timing of larger numbers.

> partition error in shake on processor 2
> this processor has atoms 13055 through 19568
> atom 19568 is within this range
> atom 19569 is not within this range !

You should look at the output file from the part of the run that failed.
Is it really running on 96 processors? How many atoms are in your system?
(Message above suggests that you have some 13000 atoms per processor....)
If you have a million atoms or more, try running your script a much smaller
system, then gradually build up. Consider whether or not you can use pmemd
instead of sander.

> It is the first time I am trying tu run this simulation on a cluster, I
> previously ran (just the first sander.MPI command) it on my machine and
> it worked, but I want to repeat this simulation on a cluster.
> Any suggestions about something I'm probably missing?

Was your local run also with the same number of processors? The "partition
error" means that the system is trying to SHAKE (constrain) the bond between
atoms 19568 and 19569. Check these atoms: is atom 19569 a hydrogen that is
bonded to atom 19568? Are those two atoms in different residues?

It is possible that partition errors can arise, depending on how complex your
system is. Since this depends on the number of processors you are using, it
could well show up on a cluster but not on a "local machine" (which, I'm
guessing, has fewer than 96 cores.)

Using the "checkValidity" command in parmed may help you localize the problem.

...good luck...dac

AMBER mailing list
Received on Tue Aug 18 2015 - 07:30:04 PDT
Custom Search