Re: [AMBER] Intermittent error during T-REMD on 8 GPU computer

From: Milo Westler <milo.nmrfam.wisc.edu>
Date: Fri, 2 May 2014 10:33:05 -0500

Jason,
  Thanks for the quick response. Actually, you can see my naivety in that I
mistakenly thought I was running SHAKE. Duh!
Thanks for the corrections, I will definitely change things. I have run the
temperatures without REMD and have not seen errors, however they seem
pretty rare and I have only run the non-interacting set of temperatures for
a couple of 10 ns runs.
I haven't used the CPU.


On Fri, May 2, 2014 at 10:16 AM, Jason Swails <jason.swails.gmail.com>wrote:

> On Fri, 2014-05-02 at 09:33 -0500, Milo Westler wrote:
> > I have had this occur 3 or 4 times during the T-REMD runs that I am
> > preforming. The energies and temperatures suddenly jump to large values
> > thus aborting the run. I am running on an Exxact 8X GTX 780 GPU system
> > running pmemd.cuda.MPI in Amber 12. The GPUs passed the
> GPU_Validation_Test
> > provided with the computer and the error shows up on different replicas
> > (see additional examples at the bottom). Is this my error or a bug?
>
> My guess is a bad set of input parameters, but it could be neither. See
> my comments below
>
> >
> > MDin file (1 of 8 temperatures spread from 300-372):
> > REMD 346K
> > &cntrl
> > ig=-1,
> > imin=0, ntx=1,
> > nstlim=500, dt=0.002,
>
> Yikes! A 2 fs time step with no SHAKE at high temperatures! This is
> asking for integration errors. I wouldn't even use a 2 fs time step
> _with_ SHAKE at temperatures near 400 K. Try setting ntc=2, ntf=2, and
> dt=0.001.
>
> > irest=0, ntt=3, gamma_ln=1.0,
> > tempi=346, temp0=346,
> > ntpr=100, ntwx=1000, ntwr=100000
> > ntb=0, igb=7,
>
> Unless you have a specific reason for choosing igb=7 (comparing GB
> models?), I suggest the igb=8 model instead.
>
> > cut=999.,rgbmax=999.,
> > numexchg=20001, ioutfm=1
> > &end
>
> If your integration error is not the (only) problem, here is my generic
> advice when debugging GPU issues:
>
> Try disabling REMD and just running 8 non-interacting groups with
> pmemd.cuda.MPI (at the same REMD temperature ladder) -- do you observe
> the same errors? Do they always occur at the same temperatures?
>
> If the errors still persist, try running each replica separately using
> pmemd.cuda in serial. Do you still get the same errors at the same
> temperatures? If so, try to capture the event in a trajectory file by
> dumping a restart file out right before it occurs and then printing out
> snapshots and energies every step to watch what happens. This should
> help provide insight.
>
> Does the same thing happen on the CPU?
>
> HTH,
> Jason
>
> --
> Jason M. Swails
> BioMaPS,
> Rutgers University
> Postdoctoral Researcher
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
-- Milo
===================================================
National Magnetic Resonance Facility at Madison
      An NIH-Supported Resource Center
W. Milo Westler, Ph.D.
NMRFAM Director
Senior Scientist
       and
Adjunct Professor
Department of Biochemistry
University of Wisconsin-Madison
433 Babcock Drive
Rm B160D
Madison, WI USA 53706-1544
EMAIL: milo.nmrfam.wisc.edu
PHONE: (608)-263-9599
FAX: (608)-263-1722
=======================================================================
========
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 02 2014 - 09:00:03 PDT
Custom Search