Re: [AMBER] Diverging sander.MPI results according to core allocation from Bashford, Donald on 2012-08-03 (Amber Archive Aug 2012)

From: Bashford, Donald <Don.Bashford.stjude.org>
Date: Fri, 3 Aug 2012 14:12:10 -0500

Thanks for the detailed response, Jason. Between my question and your
answer, our systems people have rebuilt amber with openmpi 1.6.0, the
current stable release, and have suggested we try that. I guess they
thought the mpi library might be the culprit too. So I think our first
course will be to try things with the new mpi and see if the problem
persists. I think we'll also take up your suggestion about write
calls, etc.

One thing about our testing. We first noticed the problem on ~10 ns
simulations. But for more recent tests, we wanted to run much shorter
simulations, so we've been monitoring RMSD from a reference structure
vs time and considering it a failure if the plots from two sims of a
few ps look different. The implicit assumption is that the same
starting conditions and same random seed on the same hardware should
produce IDENTICAL results regardless of the number and distribution of
nodes. I this strictly true? Is it likely to be true for some
relatively small number of time steps? On an Altix, where we use SGI's
mpi and there is no cores-across-nodes issue, this works.

-Don

At Thu, 2 Aug 2012 22:41:41 -0500,
Jason Swails wrote:
>
> Hi Don,
>
> Sorry it took me so long to respond here, I had meant to return to it when I had more time and
> it slipped my mind.
>
> There have been a number of issues with constant pH in parallel in the past, but I thought that
> I had fixed them. In my tests, I haven't seen any issues in parallel with the latest versions
> of the code (both Amber 11 and Amber 12).
>
> I do have something to look out for, though, and some suggestions for debugging. The issues in
> the past stemmed from a synchronization requirement that parallel constant pH has. The
> protonation state jump is attempted on every single thread, and the charge arrays are updated
> based on the decisions being made -- both for the proposed state and the update of the current
> state when a protonation state change is accepted.
>
> If these nodes desynchronize, then chaos ensues and the simulation is worthless from that point
> onward. What this means is that all random numbers for the constant pH decisions (i.e., which
> residue do we attempt to change, which state do we attempt to change to, do we attempt to
> change multiple residues, and do we accept the monte carlo protonation state change) must be
> identical on each node. Lines 255 to 257 of $AMBERHOME/src/sander/constantph.F90 (Amber 12)
> explicitly synchronizes this seed, so the random number stream should be identical for every
> node. (This was not necessary in Amber 11 since the desynchronization of ig when ig==-1 was
> introduced in Amber 12).
>
> Therefore, if the issue really is desynchronization, then it would have to be some trailing
> digits in the energy difference (the dvdl variable in egb.F90), which are collected via an
> MPI_Allreduce call. From what I know of the MPI standard, an Allreduce should produce
> identical results on every node, but it's possible that this may not be strictly adhered to on
> your system.
>
> To test this, you can add a write statement to the end of subroutine cnstphendstep in which you
> dump each thread's resstate array to make sure that each thread maintains the same set of
> states. Do this with a command that looks something like:
>
> write(666+mytaskid, '(15i4)') (resstate(i), i, 1, trescnt)
>
> then, make sure that all of the "fort.6**" files are identical. It's also possible that this
> write statement will fix the problems, so if they're all identical, make sure the problem is
> still happening.
>
> The version number of your mpich2 installation (is it really mpich2-1.5a2?) suggests that it is
> an alpha release, and I am tempted to start with putting the blame there. Either the bcast is
> failing before the initial seeding, or the allreduce is giving numerical differences across
> nodes which propagates to different choices for the Monte Carlo question (and subsequent
> ruining of the simulation).
>
> Hopefully this helps some...
>
> Good luck,
> Jason
>
> On Thu, Jul 26, 2012 at 6:25 PM, Bashford, Donald <Don.Bashford.stjude.org> wrote:
>
> Dear Amber List members,
>
> We are finding that when we do a constant-pH simulation with
> sander.MPI, runs using 8 or more cores that start with the same
> coordinates, protonation state, pH=7, etc. and the same random seed give
> very different results according to how cores are distributed across
> nodes within the cluster. As the runs get to the nanosecond range, the
> RMSD of some runs grows to indicates unfolding while other runs
> remain structurally stable. We initially saw the problem with Amber11,
> but preliminary tests with Amber12 (which did fine on the Amber test suite)
> show the same problems. The "best" results are seen when all of the
> allocated cores are on the same (8 core) cluster node. "Worse" results
> occur when the cores are spread over 2 or more nodes.
>
> We think that the simulations run at pH=7 should be quite stable. The
> system is a tetramer with 34 residues per monomer, including ACE and
> NME blocking groups, and 9 residues considered as titrating in each
> monomer. The titrating groups are ASPs, GLUs, TYRs and LYSs. There are
> ARGs but they are regarded as fixed positive residues. There are no
> HIS or CYS, or any other residues typically expected to titrate near
> the neutral range. The ionizable groups are either exposed and/or
> involved in close contacts with opposite charges so very little change
> of protonation state is expected at pH=7. In non-const-pH simulations
> in the past, this system has been quite stable in both explicit and
> implicit solvent.
>
> Both Amber11 and Amber12 were built with MPICH2 (release 1.8a2, I
> think) and Intel Parallel Studio compilers. Both used the amber
> configure flags, "configure intel" and "configure -mpi intel", for
> serial and parallel components, respectively.
>
> The hardware is an IBM blade cluster in which each node has 8 cores,
> is an x86_64 architecture and is running CentOS 6.1 (Linux 2.6.32).
>
> The command for running is:
>
> mpiexec_ssh -np $NUMPROC sander.MPI -O -i ../cph.in -o cph-r.out \
> -p ../tetmin.prmtop -c ../heat.rst -r cph.rst -x cph.mdcrd \
> -cpin ../cpin -cpout c.out -cprestrt rest
>
> where NUMPROC ranges from 8 to 48.
>
> The job is submitted through LSF, and we sometimes control the extent
> of spreading of allocated cores across notes with the bsub option,
> such as "-R span[ptile=8]" which, with NUMPROC=8 will wait for a node
> with all 8 cores free.
>
> The mdin file (cph.in) is:
>
> Constant PH
> &cntrl
> icnstph =1,
> irest=1, ntx=5,
> solvph = 7.0,
> cut = 30.0,
> igb =2,
> saltcon=0.1,
> ntb=0,
> dt=0.002,
> nrespa=1,
> ntt=3, tempi=300.0, temp0=300.0, tautp=2.0, gamma_ln = 1.0, ig = 1000,
> ntc = 2, ntf = 2, tol=0.000001,
> ntcnstph=18,
> nscm = 500,
> nstlim=5000000,
> ntpr = 500, ntwx = 500, ntwr = 10000
> /
>
> At this point I can't tell whether this is a problem in Amber, in
> MPICH2 or in our usage. Can anyone with MPI experience help us out?
>
> Thanks,
> Don Bashford
> Dept. Struc. Biol.
> St Jude Children's Res. Hosp.
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
>
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 03 2012 - 12:30:03 PDT