[AMBER] Diverging sander.MPI results according to core allocation from Bashford, Donald on 2012-07-26 (Amber Archive Jul 2012)

From: Bashford, Donald <Don.Bashford.stjude.org>
Date: Thu, 26 Jul 2012 17:25:11 -0500

Dear Amber List members,

We are finding that when we do a constant-pH simulation with
sander.MPI, runs using 8 or more cores that start with the same
coordinates, protonation state, pH=7, etc. and the same random seed give
very different results according to how cores are distributed across
nodes within the cluster. As the runs get to the nanosecond range, the
RMSD of some runs grows to indicates unfolding while other runs
remain structurally stable. We initially saw the problem with Amber11,
but preliminary tests with Amber12 (which did fine on the Amber test suite)
show the same problems. The "best" results are seen when all of the
allocated cores are on the same (8 core) cluster node. "Worse" results
occur when the cores are spread over 2 or more nodes.

We think that the simulations run at pH=7 should be quite stable. The
system is a tetramer with 34 residues per monomer, including ACE and
NME blocking groups, and 9 residues considered as titrating in each
monomer. The titrating groups are ASPs, GLUs, TYRs and LYSs. There are
ARGs but they are regarded as fixed positive residues. There are no
HIS or CYS, or any other residues typically expected to titrate near
the neutral range. The ionizable groups are either exposed and/or
involved in close contacts with opposite charges so very little change
of protonation state is expected at pH=7. In non-const-pH simulations
in the past, this system has been quite stable in both explicit and
implicit solvent.

Both Amber11 and Amber12 were built with MPICH2 (release 1.8a2, I
think) and Intel Parallel Studio compilers. Both used the amber
configure flags, "configure intel" and "configure -mpi intel", for
serial and parallel components, respectively.

The hardware is an IBM blade cluster in which each node has 8 cores,
is an x86_64 architecture and is running CentOS 6.1 (Linux 2.6.32).

The command for running is:

mpiexec_ssh -np $NUMPROC sander.MPI -O -i ../cph.in -o cph-r.out \
-p ../tetmin.prmtop -c ../heat.rst -r cph.rst -x cph.mdcrd \
-cpin ../cpin -cpout c.out -cprestrt rest

where NUMPROC ranges from 8 to 48.

The job is submitted through LSF, and we sometimes control the extent
of spreading of allocated cores across notes with the bsub option,
such as "-R span[ptile=8]" which, with NUMPROC=8 will wait for a node
with all 8 cores free.

The mdin file (cph.in) is:

Constant PH
&cntrl
icnstph =1,
irest=1, ntx=5,
solvph = 7.0,
cut = 30.0,
igb =2,
saltcon=0.1,
ntb=0,
dt=0.002,
nrespa=1,
ntt=3, tempi=300.0, temp0=300.0, tautp=2.0, gamma_ln = 1.0, ig = 1000,
ntc = 2, ntf = 2, tol=0.000001,
ntcnstph=18,
nscm = 500,
nstlim=5000000,
ntpr = 500, ntwx = 500, ntwr = 10000
/

At this point I can't tell whether this is a problem in Amber, in
MPICH2 or in our usage. Can anyone with MPI experience help us out?

Thanks,
Don Bashford
Dept. Struc. Biol.
St Jude Children's Res. Hosp.

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 26 2012 - 17:00:02 PDT