Re: [AMBER] Diverging sander.MPI results according to core allocation

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 2 Aug 2012 23:41:41 -0400

Hi Don,

Sorry it took me so long to respond here, I had meant to return to it when
I had more time and it slipped my mind.

There have been a number of issues with constant pH in parallel in the
past, but I thought that I had fixed them. In my tests, I haven't seen any
issues in parallel with the latest versions of the code (both Amber 11 and
Amber 12).

I do have something to look out for, though, and some suggestions for
debugging. The issues in the past stemmed from a synchronization
requirement that parallel constant pH has. The protonation state jump is
attempted on every single thread, and the charge arrays are updated based
on the decisions being made -- both for the proposed state and the update
of the current state when a protonation state change is accepted.

If these nodes desynchronize, then chaos ensues and the simulation is
worthless from that point onward. What this means is that all random
numbers for the constant pH decisions (i.e., which residue do we attempt to
change, which state do we attempt to change to, do we attempt to change
multiple residues, and do we accept the monte carlo protonation state
change) must be identical on each node. Lines 255 to 257 of
$AMBERHOME/src/sander/constantph.F90 (Amber 12) explicitly synchronizes
this seed, so the random number stream should be identical for every node.
 (This was not necessary in Amber 11 since the desynchronization of ig when
ig==-1 was introduced in Amber 12).

Therefore, if the issue really is desynchronization, then it would have to
be some trailing digits in the energy difference (the dvdl variable in
egb.F90), which are collected via an MPI_Allreduce call. From what I know
of the MPI standard, an Allreduce should produce identical results on every
node, but it's possible that this may not be strictly adhered to on your
system.

To test this, you can add a write statement to the end of subroutine
cnstphendstep in which you dump each thread's resstate array to make sure
that each thread maintains the same set of states. Do this with a command
that looks something like:

write(666+mytaskid, '(15i4)') (resstate(i), i, 1, trescnt)

then, make sure that all of the "fort.6**" files are identical. It's also
possible that this write statement will fix the problems, so if they're all
identical, make sure the problem is still happening.

The version number of your mpich2 installation (is it really mpich2-1.5a2?)
suggests that it is an alpha release, and I am tempted to start with
putting the blame there. Either the bcast is failing before the initial
seeding, or the allreduce is giving numerical differences across nodes
which propagates to different choices for the Monte Carlo question (and
subsequent ruining of the simulation).

Hopefully this helps some...

Good luck,
Jason

On Thu, Jul 26, 2012 at 6:25 PM, Bashford, Donald
<Don.Bashford.stjude.org>wrote:

> Dear Amber List members,
>
> We are finding that when we do a constant-pH simulation with
> sander.MPI, runs using 8 or more cores that start with the same
> coordinates, protonation state, pH=7, etc. and the same random seed give
> very different results according to how cores are distributed across
> nodes within the cluster. As the runs get to the nanosecond range, the
> RMSD of some runs grows to indicates unfolding while other runs
> remain structurally stable. We initially saw the problem with Amber11,
> but preliminary tests with Amber12 (which did fine on the Amber test suite)
> show the same problems. The "best" results are seen when all of the
> allocated cores are on the same (8 core) cluster node. "Worse" results
> occur when the cores are spread over 2 or more nodes.
>
> We think that the simulations run at pH=7 should be quite stable. The
> system is a tetramer with 34 residues per monomer, including ACE and
> NME blocking groups, and 9 residues considered as titrating in each
> monomer. The titrating groups are ASPs, GLUs, TYRs and LYSs. There are
> ARGs but they are regarded as fixed positive residues. There are no
> HIS or CYS, or any other residues typically expected to titrate near
> the neutral range. The ionizable groups are either exposed and/or
> involved in close contacts with opposite charges so very little change
> of protonation state is expected at pH=7. In non-const-pH simulations
> in the past, this system has been quite stable in both explicit and
> implicit solvent.
>
> Both Amber11 and Amber12 were built with MPICH2 (release 1.8a2, I
> think) and Intel Parallel Studio compilers. Both used the amber
> configure flags, "configure intel" and "configure -mpi intel", for
> serial and parallel components, respectively.
>
> The hardware is an IBM blade cluster in which each node has 8 cores,
> is an x86_64 architecture and is running CentOS 6.1 (Linux 2.6.32).
>
> The command for running is:
>
> mpiexec_ssh -np $NUMPROC sander.MPI -O -i ../cph.in -o cph-r.out \
> -p ../tetmin.prmtop -c ../heat.rst -r cph.rst -x cph.mdcrd \
> -cpin ../cpin -cpout c.out -cprestrt rest
>
> where NUMPROC ranges from 8 to 48.
>
> The job is submitted through LSF, and we sometimes control the extent
> of spreading of allocated cores across notes with the bsub option,
> such as "-R span[ptile=8]" which, with NUMPROC=8 will wait for a node
> with all 8 cores free.
>
> The mdin file (cph.in) is:
>
> Constant PH
> &cntrl
> icnstph =1,
> irest=1, ntx=5,
> solvph = 7.0,
> cut = 30.0,
> igb =2,
> saltcon=0.1,
> ntb=0,
> dt=0.002,
> nrespa=1,
> ntt=3, tempi=300.0, temp0=300.0, tautp=2.0, gamma_ln = 1.0, ig = 1000,
> ntc = 2, ntf = 2, tol=0.000001,
> ntcnstph=18,
> nscm = 500,
> nstlim=5000000,
> ntpr = 500, ntwx = 500, ntwr = 10000
> /
>
> At this point I can't tell whether this is a problem in Amber, in
> MPICH2 or in our usage. Can anyone with MPI experience help us out?
>
> Thanks,
> Don Bashford
> Dept. Struc. Biol.
> St Jude Children's Res. Hosp.
>
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 02 2012 - 21:00:04 PDT
Custom Search