Re: [AMBER] Intermittent error during T-REMD on 8 GPU computer

From: Jason Swails <jason.swails.gmail.com>
Date: Fri, 02 May 2014 11:16:20 -0400

On Fri, 2014-05-02 at 09:33 -0500, Milo Westler wrote:
> I have had this occur 3 or 4 times during the T-REMD runs that I am
> preforming. The energies and temperatures suddenly jump to large values
> thus aborting the run. I am running on an Exxact 8X GTX 780 GPU system
> running pmemd.cuda.MPI in Amber 12. The GPUs passed the GPU_Validation_Test
> provided with the computer and the error shows up on different replicas
> (see additional examples at the bottom). Is this my error or a bug?

My guess is a bad set of input parameters, but it could be neither. See
my comments below

>
> MDin file (1 of 8 temperatures spread from 300-372):
> REMD 346K
> &cntrl
> ig=-1,
> imin=0, ntx=1,
> nstlim=500, dt=0.002,

Yikes! A 2 fs time step with no SHAKE at high temperatures! This is
asking for integration errors. I wouldn't even use a 2 fs time step
_with_ SHAKE at temperatures near 400 K. Try setting ntc=2, ntf=2, and
dt=0.001.

> irest=0, ntt=3, gamma_ln=1.0,
> tempi=346, temp0=346,
> ntpr=100, ntwx=1000, ntwr=100000
> ntb=0, igb=7,

Unless you have a specific reason for choosing igb=7 (comparing GB
models?), I suggest the igb=8 model instead.

> cut=999.,rgbmax=999.,
> numexchg=20001, ioutfm=1
> &end

If your integration error is not the (only) problem, here is my generic
advice when debugging GPU issues:

Try disabling REMD and just running 8 non-interacting groups with
pmemd.cuda.MPI (at the same REMD temperature ladder) -- do you observe
the same errors? Do they always occur at the same temperatures?

If the errors still persist, try running each replica separately using
pmemd.cuda in serial. Do you still get the same errors at the same
temperatures? If so, try to capture the event in a trajectory file by
dumping a restart file out right before it occurs and then printing out
snapshots and energies every step to watch what happens. This should
help provide insight.

Does the same thing happen on the CPU?

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 02 2014 - 08:30:02 PDT
Custom Search