Re: [AMBER] H-REMD failure on GPU from Jason Swails on 2015-02-26 (Amber Archive Feb 2015)

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 26 Feb 2015 08:51:49 -0500

On Wed, 2015-02-25 at 17:57 +0100, Carlo Guardiani wrote:
> Dear Amber Developers,
>
> I am trying to run a simulation of H-REMD on a membrane protein.
> I am using Amber14 with the ff14SB and gaff force fields.
> My system comprizes the channel protein, a small patch of membrane
> and water and ions on both sides. My aim is to bias the backbone
> dihedrals of the N-terminal tail of the protein rescaling the force
> costant by factors 1.0, 0.9, 0.8, 0.7, 0,6, 0.5, 0.4, 0.3. The
> topology files have been edited using parmed.py. For instance to
> rescale the dihedral force constants by a factor of 0.5 I used the
> script:
>
> #Backbone phi angles
>
> deleteDihedral :1-36.C :2-37.N :2-37.CA :2-37.C
> addDihedral :1-36.C :2-37.N :2-37.CA :2-37.C 0.0000 4.0000
> 0.0000 1.2000 2.0000 type multiterm
> addDihedral :1-36.C :2-37.N :2-37.CA :2-37.C 0.0000 1.0000
> 0.0000 1.2000 2.0000 type multiterm
> addDihedral :1-36.C :2-37.N :2-37.CA :2-37.C 0.1350 2.0000
> 0.0000 1.2000 2.0000 type multiterm
> addDihedral :1-36.C :2-37.N :2-37.CA :2-37.C 0.2100 3.0000
> 0.0000 1.2000 2.0000
>
>
> #Backbone psi angles
>
> deleteDihedral :2-36.N :2-36.CA :2-36.C :3-37.N
> addDihedral :2-36.N :2-36.CA :2-36.C :3-37.N 0.2250 1.0000
> 180.0001 1.2000 2.0000 type multiterm
> addDihedral :2-36.N :2-36.CA :2-36.C :3-37.N 0.7900 2.0000
> 180.0001 1.2000 2.0000 type multiterm
> addDihedral :2-36.N :2-36.CA :2-36.C :3-37.N 0.2750 3.0000
> 180.0001 1.2000 2.0000 type multiterm
> addDihedral :2-36.N :2-36.CA :2-36.C :3-37.N 0.0000 4.0000
> 0.0000 1.2000 2.0000

This is an ambitious use of parmed! I would recommend checking that it
performed what you wanted it to correctly. Here are some things to
check:

1. Make sure that the total number of dihedrals both before and after
are the same (this does not apply to the number of dihedral *types*,
just the number of dihedrals with and without hydrogen). You can use
the 'printPointers' command in ParmEd, and look for the variables NPHIA
and NPHIH (# of dihedrals without and with hydrogen, respectively). As
long as you replaced 4-term torsions with other 4-term torsions, that
should be fine.

2. Make sure that all of the dihedral definitions are 'correct'. i.e.,
make sure that the dihedrals added by ParmEd are between the same atoms
as the ones deleted. printDihedrals can do this. [1]

3. Make sure that if you run a single-point energy calculation with each
prmtop, the 1-4 electrostatics and 1-4 van der Waals energies come out
identical. These pairlists are built from the list of torsions, so if
they match it is a very good indication that you added the torsions
correctly.

This is good general advice, but I suspect this is not the problem (see
below).

> parmout vdac2_trim_0.5.prmtop
>
>
> Each of the eight replicas was minimized using its own topology file:
>
> &cntrl
> imin = 1,
> maxcyc = 2000,
> ncyc = 1000,
> ntb = 1,
> ntr = 0,
> cut = 8.0,
> ntwr = 2000,
> ntpr = 100,
> /
>
> sander -O -i min.in -o min.out -p vdac2_trim_1.0.prmtop -c
> hvdac2-ions1M_trim.inpcrd -r vdac2_trim_min_1.0.rst
>
> After the minimization each replica was equilibrated in the NPT ensemble
> using its own topology file:
>
> Equilibration
> &cntrl
> irest=0, ntx=1,
> nstlim=2500000, dt=0.002,
> ntt=3, gamma_ln=1.0,
> temp0=300.0,
> ntc=2, ntf=2, nscm=1000,
> ntb=2, ntp=2, cut=8.0,
> ntpr=100, ntwx=100, ntwr=1000,
> /

THIS is your problem. Each replica is being simulated in its own NPT
simulation -- with anisotropic scaling at that. At the end of this
step, all 8 simulations have different box sizes, volumes, and even
shapes (since you're using anisotropic scaling). No exchange will ever
work when you are trying to swap particle positions between replicas
with different box shapes -- you'll get overlapping periodic images that
will lead to infinities when calculating exchange acceptance rates (and
it will actually bring down the GPU entirely due to the incredibly large
forces overflowing the fixed-precision accumulators; as you've seen).

You split up your replicas too early. My workflow when running REMD
(this is true for all REMD runs I do, which include T-REMD, H-REMD,
pH-REMD, and MultiD-REMD) -- start with ONE system; minimize, heat,
equilibrate at constant pressure to stabilize the density. For T- and
H-REMD, start with the low-T or unaltered H, since those are the
replicas you're ultimately interested in. For pH-REMD, the choice is
arbitrary (and it makes no difference).

AFTER equilibration and density stabilization, *then* do another round
of NVT equilibration; one for each replica (*this* is when you split up
your replicas). In this instance, all replicas will *always* have the
same box dimensions, and REMD should work just fine.

The reason your histograms were misleading is explained below
(basically, it wasn't the dihedral energies that were messing things up,
it was the nonbonded energies).

This really should be turned into a runtime check in sander and pmemd to
make sure you don't attempt REMD simulations with replicas at different
box sizes/shapes.

> Up to this point everything was fine and I did not experience any kind of
> problem. Finally I tried to run the H-REMD simulation using as input
> configurations the restart structures yielded by the equilibration. My
> input files (indeed all equal) are as follows:
>
> H-REMD
> &cntrl
> irest=0, ntx=1,
> nstlim=1250, dt=0.002,
> ntt=3, gamma_ln=1.0,
> tempi=300.0, temp0=300.0,
> ig=-1, iwrap=1,
> ntc=2, ntf=2, nscm=1000,
> ntp=0, ntb=1, cut=8.0,
> ntxo=2, ioutfm=1,
> ntpr=1250, ntwx=1250, ntwr=1250,
> numexchg=100,
> /
>
> I used the following groupfile:
>
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.001 -o hremd.mdout_R1.001 -c
> vdac2_trim_equil.rst_R2.001 -r hremd.rst_R1.001 -x hremd.netcdf_R1.001
> -inf hremd.mdinfo_R1.001 -p vdac2_trim_1.0.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.002 -o hremd.mdout_R1.002 -c
> vdac2_trim_equil.rst_R2.002 -r hremd.rst_R1.002 -x hremd.netcdf_R1.002
> -inf hremd.mdinfo_R1.002 -p vdac2_trim_0.9.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.003 -o hremd.mdout_R1.003 -c
> vdac2_trim_equil.rst_R2.003 -r hremd.rst_R1.003 -x hremd.netcdf_R1.003
> -inf hremd.mdinfo_R1.003 -p vdac2_trim_0.8.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.004 -o hremd.mdout_R1.004 -c
> vdac2_trim_equil.rst_R2.004 -r hremd.rst_R1.004 -x hremd.netcdf_R1.004
> -inf hremd.mdinfo_R1.004 -p vdac2_trim_0.7.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.005 -o hremd.mdout_R1.005 -c
> vdac2_trim_equil.rst_R2.005 -r hremd.rst_R1.005 -x hremd.netcdf_R1.005
> -inf hremd.mdinfo_R1.005 -p vdac2_trim_0.6.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.006 -o hremd.mdout_R1.006 -c
> vdac2_trim_equil.rst_R2.006 -r hremd.rst_R1.006 -x hremd.netcdf_R1.006
> -inf hremd.mdinfo_R1.006 -p vdac2_trim_0.5.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.007 -o hremd.mdout_R1.007 -c
> vdac2_trim_equil.rst_R2.007 -r hremd.rst_R1.007 -x hremd.netcdf_R1.007
> -inf hremd.mdinfo_R1.007 -p vdac2_trim_0.4.prmtop
> -O -rem 3 -remlog rem_R1.log -i hremd.mdin0.008 -o hremd.mdout_R1.008 -c
> vdac2_trim_equil.rst_R2.008 -r hremd.rst_R1.008 -x hremd.netcdf_R1.008
> -inf hremd.mdinfo_R1.008 -p vdac2_trim_0.3.prmtop
>
>
> Finally, I tried to run the job using the GPUs:
>
> #!/bin/bash
>
> source /etc/modules.sh
> module load intel openmpi amber14
>
> export CUDA_VISIBLE_DEVICES=0,1,2,3
>
> mpirun -np 8 pmemd.cuda.MPI -ng 8 -groupfile hremd.groupfile
>
> The job unfortunately aborted almost immediately yielding
> nan and anomalous energy values from the very first steps.
> I also received the error message:
>
> Error: unspecified launch failure launching kernel kClearForces
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>
> I verified that:
>
> 1) The input configurations are not corrupt. In fact, if I restart
> the equilibration run, this proceeds without problems.
> 2) The topology files also should be correct, otherwise I should
> have had problems during the equilibration.
> 3) The protocol and input files for H-REMD are the same I used in
> an H-REMD simulation of an other system (the N-tail peptides of
> the channel protein in an octahedral water box) that didn't
> cause any problem.
>
>
> When I tried to re-run the simulation on the CPUs,
>
> mpirun -np 8 sander.MPI -ng 8 -groupfile hremd.groupfile
>
> the simulation did not abort. However in the mdout file I received the
> following error message that was repeated at every printout of the
> energies:
>
> NetCDF error: NetCDF: Variable not found
> at write temp0

This is very weird. I'll try to reproduce it. Basically what I *think*
it means is that sander is trying to write the temperature to the
trajectory file because it has detected that REMD is active. However,
temperatures are not stored in H-REMD, since they're irrelevant (T *can*
change, and the detailed balance equation implemented allows for it to
change, but only structures swap, so T never exchanges between
replicas). So the temp0 NetCDF variable is not set up when the
trajectory is initialized. If this doesn't crash sander, it's almost
certainly not causing your problems.

> Moreover, the exchanges between replicas were never accepted, because
> apparently the energy difference between replicas was too large. Here
> is a sample of my rem.log file:
>
[snip]
> Indeed the energy distributions at the end of the equilibration are not
> wonderful since some of the distributions are probably too much
> overlapping (see attached Figure). In such a case however, the problem
> should be the opposite as the one I observed: the replicas should exchange
> too often. In any case I checked that the dihedral force constants in the
> topology files correspond to the values I set with parmed.py.

You're exactly right -- strongly overlapping distributions should
indicate high acceptance ratios, not low (meaning you could get away
with far fewer replicas and still maintain good acceptance rates).
However, your histograms are misleading. This is what you would do for
T-REMD, where the PE distributions of each temperature is enough to
almost perfectly predict what the acceptance criteria will be. This is
because you exchange T, not structures, and take advantage of the fact
that the underlying potential energy surface is identical for all
replicas. For H-REMD, this is not true. MC attempts are based off
swapping structures, computing energy *differences* between the
different structures at different Hamiltonians, then taking the
difference of those differences. So the distributions that will help you
predict acceptance rates are much more complicated than a simple
histogram of potential energies for each replica in its own subset of
structures. You need the *total* energy (not just dihedral energies,
energy terms in a force field are not orthogonal); and more precisely
you need differences in total energy between adjacent Hamiltonians.
Basically you need the distributions of [H1(x1)-H1(x2)] vs.
[H2(x2)-H2(x1)]; H1(x1) vs. H2(x2) simply won't cut it.

> In the hypothesis that this could be a problem of precision, I asked our
> system administrator to re-compile the Amber14 code using DPFP. In this
> case however I got the error message: "unknown flag: -ng" which means that
> the code in double precision is uncapable of running parallel jobs with
> multisander.

This actually means that the administrator only built the serial
version. You need to compile pmemd.cuda_DPFP.MPI. However, sander is
pure double precision, so precision is almost certainly not your
problem.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Feb 26 2015 - 06:00:02 PST