Re: [AMBER] max pairlist cutoff error on octahedral box from Ross Walker on 2011-02-15 (Amber Archive Feb 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 15 Feb 2011 12:50:58 -0800

Hi Bill,

It is very strange that this happens on the 'C'PU. I have NEVER seen such a
thing on Kraken despite running many tens of microseconds of MD on that
machine so it is a little disconcerting that people are seeing this. That
said there are a few things I ALWAYS run with on Kraken.

1) I compile with -DNO_NTT3_SYNC

This means if I run ntt=3 the random number stream is not synchronized
between threads, each thread uses its own random number stream.

2) I rarely run production MD with ntt=3, I prefer to equilibrate with this
and then run ntt=1 with a long coupling constant. Something like tautp=10.0
figuring that the system should already be well equilibrated so use of a
weak berendsen thermostat is not too bad.

3) When I do run ntt=3 I rarely use gamma_ln > 2.0 or the system can become
very viscous.

4) I typically do not run more than 10ns in a single simulation block.

5) I usually only run with cut=8.0

So my main questions would be do you ever see this problem occur if you set
ntt=1? - My initial hunch is that it is related to ntt=3. What happens if
you recompile with -DNO_NTT3_SYNC in the config.h file. Then if you run with
ntt=3 does the problem occur?

What about using a smaller cutoff? Do you ever see the problem?

Is there anything in the previous steps that would lead you to suspect there
is something wrong with your simulation?

Additionally do these problems ONLY occur with ntb=2 or also with constant
volume?

It would be useful if someone who has the time to look into this properly
could do a proper 'scientific like' exhaustive study on the problem and work
out exactly which combination of things this occurs with and what it
doesn't. Without this info such a heisenbug is going to be impossible to
track down.

My suspicion right now is that there may be an issue in the random number
generator. This get "abused" when running with ntt=3 so would be the prime
suspect if runs with ntt=1 do not cause issues.

Of course there is always the argument of dodgy hardware, cosmic rays etc
but that is such an easy escape that I would rather exhaust all other
options first.

Thanks for the help and feedback.

All the best
Ross

> -----Original Message-----
> From: Bill Miller III [mailto:brmilleriii.gmail.com]
> Sent: Tuesday, February 15, 2011 4:13 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] max pairlist cutoff error on octahedral box
>
> I have also seen this error randomly on a system I am running using
regular
> pmemd (i.e. not the GPU version) using Amber 11 on 256 processors on
> Athena.
> I have seen the error on four different systems I have been running. The
> systems are all fairly large (up to 350,000 atoms). The error never occurs
> at the same place twice. However, the error has occurred more frequently
> for
> me than Bongkeun. It sometimes does not happen for several nanoseconds,
> but
> can also happen many times per nanosecond. The error:
>
> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
>
> is written once at the very end of the mdout file, and a couple of hundred
> times in the STDOUT file. We also turned verbose on to see what happened
> with the forces right before the error. The last step showed NaN for
> essentially all energy values, followed of course by the error message.
>
> NET FORCE PER ATOM: 0.1203E-05 0.6295E-06 0.2283E-05
>
> Evdw = -1296.057283476214
> Ehbond = 0.000000000000
> Ecoulomb = -220472.976904191100
>
>
> Iso virial = 199523.479901402000
> Eevir vs. Ecoulomb = 5.356456894359
> a,b,c,volume now equal to 168.045 168.045 168.045
3653026.946
> NET FORCE PER ATOM: NaN NaN NaN
>
> Evdw = -1296.057285985989
> Ehbond = 0.000000000000
> Ecoulomb = NaN
>
>
> Iso virial = NaN
> Eevir vs. Ecoulomb = 0.000000000000
> a,b,c,volume now equal to NaN NaN NaN
NaN
> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
>
> This shows that, as expected, the system is blowing up right before the
job
> dies. Below is the pmemd mdin file used for this simulation (with verbose
> turned off here, obviously).
>
> MD Run in pmemd.
> &cntrl
> nstlim=250000, owtnm='O', hwtnm1='H1',
> dielc=1, nrespa=1, temp0=310,
> tol=1e-05, vlimit=20, iwrap=1, ntc=2,
> ig=-1, pres0=1, ntb=2, ntrx=1,
> ibelly=0, nmropt=0, hwtnm2='H2',
> imin=0, ntxo=1, watnam='WAT', igb=0,
> comp=44.6, jfastw=0, ntx=5, ipol=0,
> nscm=1000, ntp=1, tempi=0, ntr=0,
> ntt=3, ntwr=1000, cut=10, ntave=0,
> dt=0.002, ntwx=1000, ntf=2, irest=1,
> ntpr=100, taup=1, gamma_ln=5,
> ioutfm=1,
> /
> &ewald
> verbose=0, ew_type=0, eedtbdns=500,
> netfrc=1, dsum_tol=1e-05, skinnb=2,
> rsum_tol=5e-05, nbtell=0, nbflag=1,
> frameon=1, vdwmeth=1, order=4, eedmeth=1,
> /
>
> I am using cut=10 here, but I have also tried cut=8 with the same results.
>
> I hope all this helps pinpoint the source of the problem. Let me know if
you
> have any questions or if you have any suggestions.
>
> -Bill
>
>
> On Mon, Feb 14, 2011 at 4:57 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
> wrote:
>
> > Hello Ross,
> >
> > I posted my answers between the lines.
> >
> > Quoting Ross Walker <ross.rosswalker.co.uk>:
> >
> > > Hi Bongkeun,
> > >
> > > Unfortunately it is going to be hard to figure out what is going on
here
> > > without doing some more digging. The error you see is somewhat
> misleading
> > > since it is effectively what happens if your system blows up. Some
atom
> > gets
> > > a huge force on it etc etc. There are a number of things that can
cause
> > this
> > > including everything from a bug in the code, issues with force field
> > > parameters and even flakey hardware. Can you check a few things for
> me.
> > >
> > > 1) Verify you definitely have bugfix.12 applied. Your output file
should
> > > say:
> > >
> > > |--------------------- INFORMATION ----------------------
> > > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > > | Version 2.1
> > > | 12/20/2010
> > >
> > Yes, it is from bugfix.12
> >
> > > 2) Verify that you can reproduce this error if you start this
calculation
> > > again on the same hardware. Does it always occur at the same point.
> > >
> > No, I got this error randomly.
> >
> > > 3) Confirm exactly what hardware you are using. If this is NOT a C20XX
> > > series board then the chance of it being flakey hardware are much
> higher.
> > >
> > It's from C1070 family
> >
> > > 4) Finally try setting NTPR=1 and rerunning the calculation to see if
it
> > > crashes at the same place. That way we will be able to see exactly
what
> > > happened before the error was triggered.
> > >
> > I cannot see any error when using NTPR=1. This error came about once
> > in 100ns randomly. I assume than heating on GPU may occur this error,
> > so I separated runs in every 10ns and allowed 5 min idling to cool
> > down GPUs. Each run spends about 10 hours.
> > Thanks.
> > Bongkeun Kim
> >
> > > Thanks,
> > >
> > > All the best
> > > Ross
> > >
> > >> -----Original Message-----
> > >> From: Bongkeun Kim [mailto:bkim.chem.ucsb.edu]
> > >> Sent: Monday, February 14, 2011 11:12 AM
> > >> To: amber
> > >> Subject: [AMBER] max pairlist cutoff error on octahedral box
> > >>
> > >> Hello,
> > >>
> > >> I got the following error message when I run on AMBER 11 GPU.
> > >> -------------------------------------------------------------
> > >> NSTEP = 420000 TIME(PS) = 155540.000 TEMP(K) = 312.04
PRESS
> > >> = -187.3
> > >> Etot = -18757.4114 EKtot = 4575.5386 EPtot =
> > >> -23332.9500
> > >> BOND = 58.8446 ANGLE = 136.7403 DIHED =
> > >> 166.0070
> > >> 1-4 NB = 56.7262 1-4 EEL = -31.1536 VDWAALS =
> > >> 3080.9761
> > >> EELEC = -26801.0907 EHBOND = 0.0000 RESTRAINT =
> > >> 0.0000
> > >> EKCMT = 2199.9567 VIRIAL = 2501.7076 VOLUME =
> > >> 74625.8314
> > >> Density =
> > >> 0.9841
> > >>
> > >>
> > >
> >
----------------------------------------------------------------------------
> > > --
> > >>
> > >> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> > > radius!
> > >> -----------------------------------------------------------------
> > >>
> > >> This error occurred randomly and once I used the last rst file I can
> > >> continue running. I already applied bugfix 12 and I used cutoff=8
> > >> Please let me know how to avoid this error.
> > >> Thank you.
> > >> Bongkeun Kim
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> >
> >
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
> Bill Miller III
> Quantum Theory Project,
> University of Florida
> Ph.D. Graduate Student
> 352-392-6715
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 15 2011 - 13:00:06 PST