Re: [AMBER] H-REMD, pmemd fails to calculate potential energy of neighbor coordinates from Jason Swails on 2012-12-16 (Amber Archive Dec 2012)

From: Jason Swails <jason.swails.gmail.com>
Date: Sun, 16 Dec 2012 10:20:39 -0500

On Sat, Dec 15, 2012 at 11:26 PM, Jiri Wiesner <wiesner.chemi.muni.cz>wrote:

> Dear Amber developers and users:
> I use the Hamiltonian replica exchange method to calculate the free energy
> of the perturbation between the first and the last replica (REFEP). My
> Amber source tree is updated by the latest set of patches (29 for
> AmberTools, 13 Amber). I have compiled the sources by Intel compiler 12 and
> I use OpenMPI 1.6.3. I have 16 replicas and I have tested both sander and
> pmemd. I have tried to run sander on 16, 32 and 64 CPUs (all in a single
> machine) and pmemd on 32 and 64 CPUs (sander can use 1 or more CPUs per
> replica, pmemd only allows the utilization of 2 or more CPUs per replica).
> My system is solvated, has PBC and thus PME is switched on.
>

I have to ask -- does the $AMBERHOME/test/h_rem test pass with both sander
and pmemd? You can check this explicitly by setting DO_PARALLEL='mpirun
-np <2,4,8>" and running "make test.sander.REM" and "make test.pmemd.REM".

>
> The problem is that whereas sander on 16 CPUs gives this result:
> [snip]
>

> Please note that the potential energy of the neighbor's coordinates in the
> pmemd run is substantially higher than in the sander run. Some more rem.log
> files are attached (sander on 16 CPUs, sander on 64 CPUs, pmemd on 32 CPUs,
> pmemd on 64 CPUs). There is exactly the same issue under the GNU 4.3.2
> compiler and a greater number of CPUs per replica basically makes the
> calculation fail.
>

This is obviously not good...

I was also trying to investigate the situation on my own and modified the
> code of both sander and pmemd to obtain a dump of some arrays - forces in
> the case of sander (ftemp array in remd.F90, line 2447 produced by call
> force(x,ix,ih,...)), most of the arguments of the pme_force subroutine
> (frc_temp array in remd_exch.F90, line 523, call pme_force(atm_cnt,
> crd_temp, frc_temp, ...)). I am attaching the forces calculated by replica
> 1 the first time the above mentioned subroutines were executed - the
> (neighbor) coordinates are that of replica 16. I think that the values in
> the files should be the same, because the forces are from the very start of
> the simulation, but they are not. It is notable that the file
> pmemd_run_64_forces.001 (pmemd run on 64 CPUs) contains circa 75% of zero
> components and 25% of non-zero components, which would correspond to the
> part done by the master process of that replica. I have no knowledge of the
> internals of the pme_force subroutine, therefore I am quite helpless at
> this point. I am not sure if my findings about the forces are of any
> relevance to the failure to calculate the potential energy of neighbor
> coordinates.
>

You nailed the reason why most forces are 0 on the master process.
However, unless you synchronize the forces on each thread, I don't think
they're comparable. Especially with PME, the same processor won't
necessarily compute all components of the force for the same atoms -- some
cores are responsible for the reciprocal space part of the PME force while
other cores do the direct space sum. If you want an 'updated' force array,
you need to allocate a separate NATOM*3-length array (let's say f_buf), and
run the command:

call MPI_Reduce(ftemp, f_buf, natom*3, MPI_DOUBLE_PRECISION, MPI_SUM, 0,
pmemd_comm, err_code_mpi)

Also, the only steps that can really be accurately compared are the first
steps, or any steps where the coordinates are identical. Can you set
ntwx=1, ntpr=1 and track where they start to diverge?

I'll take a look at your files when I get a chance.

Thanks for the report!
Jason

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Sun Dec 16 2012 - 07:30:02 PST