Ross,
Thanks for your comments and responses. You can see my answers below.
-Billy
On Tue, Feb 15, 2011 at 3:50 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Bill,
>
> It is very strange that this happens on the 'C'PU. I have NEVER seen such a
> thing on Kraken despite running many tens of microseconds of MD on that
> machine so it is a little disconcerting that people are seeing this. That
> said there are a few things I ALWAYS run with on Kraken.
>
> 1) I compile with -DNO_NTT3_SYNC
>
> This means if I run ntt=3 the random number stream is not synchronized
> between threads, each thread uses its own random number stream.
> 2) I rarely run production MD with ntt=3, I prefer to equilibrate with this
> and then run ntt=1 with a long coupling constant. Something like tautp=10.0
> figuring that the system should already be well equilibrated so use of a
> weak berendsen thermostat is not too bad.
>
> 3) When I do run ntt=3 I rarely use gamma_ln > 2.0 or the system can become
> very viscous.
>
This doesn't seem to be a problem for our system. Our protein is still very
dynamic in the expected locations.
>
> 4) I typically do not run more than 10ns in a single simulation block.
>
We are actually running in 0.5 ns blocks. It was originally 1.0 ns blocks,
but I got frustrated with the errors so I cut it down to 0.5 ns blocks to
increase my chances of finishing the chunk without errors. :)
>
> 5) I usually only run with cut=8.0
> So my main questions would be do you ever see this problem occur if you set
> ntt=1? - My initial hunch is that it is related to ntt=3. What happens if
> you recompile with -DNO_NTT3_SYNC in the config.h file. Then if you run
> with
> ntt=3 does the problem occur?
>
We already compile Amber with the -DNO_NTT3_SYNC flag in the config.h file
on Kraken and Athena. So yes, we do see the error with these circumstances.
>
> What about using a smaller cutoff? Do you ever see the problem?
>
I have tried running with cut=8.0, also, but I get the same error with no
difference from using cut=10.
>
> Is there anything in the previous steps that would lead you to suspect
> there
> is something wrong with your simulation?
>
I have seen nothing to suggest that something is going wrong. Setting
verbose=1 showed nothing except for the very last step that had NaN in place
of the energy values. However, I am going to experiment with setting ntwx=1
to see if the trajectory shows something going wrong. I have hesitated doing
this previously because of the file size that might accumulate with such a
large system (~350,000 atoms).
>
> Additionally do these problems ONLY occur with ntb=2 or also with constant
> volume?
>
I have not tried with constant volume, so this is something I will look
into, as well.
>
> It would be useful if someone who has the time to look into this properly
> could do a proper 'scientific like' exhaustive study on the problem and
> work
> out exactly which combination of things this occurs with and what it
> doesn't. Without this info such a heisenbug is going to be impossible to
> track down.
>
> My suspicion right now is that there may be an issue in the random number
> generator. This get "abused" when running with ntt=3 so would be the prime
> suspect if runs with ntt=1 do not cause issues.
>
> Of course there is always the argument of dodgy hardware, cosmic rays etc
> but that is such an easy escape that I would rather exhaust all other
> options first.
>
> Thanks for the help and feedback.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Bill Miller III [mailto:brmilleriii.gmail.com]
> > Sent: Tuesday, February 15, 2011 4:13 AM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] max pairlist cutoff error on octahedral box
> >
> > I have also seen this error randomly on a system I am running using
> regular
> > pmemd (i.e. not the GPU version) using Amber 11 on 256 processors on
> > Athena.
> > I have seen the error on four different systems I have been running. The
> > systems are all fairly large (up to 350,000 atoms). The error never
> occurs
> > at the same place twice. However, the error has occurred more frequently
> > for
> > me than Bongkeun. It sometimes does not happen for several nanoseconds,
> > but
> > can also happen many times per nanosecond. The error:
> >
> > | ERROR: max pairlist cutoff must be less than unit cell max sphere
> > radius!
> >
> > is written once at the very end of the mdout file, and a couple of
> hundred
> > times in the STDOUT file. We also turned verbose on to see what happened
> > with the forces right before the error. The last step showed NaN for
> > essentially all energy values, followed of course by the error message.
> >
> > NET FORCE PER ATOM: 0.1203E-05 0.6295E-06 0.2283E-05
> >
> > Evdw = -1296.057283476214
> > Ehbond = 0.000000000000
> > Ecoulomb = -220472.976904191100
> >
> >
> > Iso virial = 199523.479901402000
> > Eevir vs. Ecoulomb = 5.356456894359
> > a,b,c,volume now equal to 168.045 168.045 168.045
> 3653026.946
> > NET FORCE PER ATOM: NaN NaN NaN
> >
> > Evdw = -1296.057285985989
> > Ehbond = 0.000000000000
> > Ecoulomb = NaN
> >
> >
> > Iso virial = NaN
> > Eevir vs. Ecoulomb = 0.000000000000
> > a,b,c,volume now equal to NaN NaN NaN
> NaN
> > | ERROR: max pairlist cutoff must be less than unit cell max sphere
> > radius!
> >
> > This shows that, as expected, the system is blowing up right before the
> job
> > dies. Below is the pmemd mdin file used for this simulation (with verbose
> > turned off here, obviously).
> >
> > MD Run in pmemd.
> > &cntrl
> > nstlim=250000, owtnm='O', hwtnm1='H1',
> > dielc=1, nrespa=1, temp0=310,
> > tol=1e-05, vlimit=20, iwrap=1, ntc=2,
> > ig=-1, pres0=1, ntb=2, ntrx=1,
> > ibelly=0, nmropt=0, hwtnm2='H2',
> > imin=0, ntxo=1, watnam='WAT', igb=0,
> > comp=44.6, jfastw=0, ntx=5, ipol=0,
> > nscm=1000, ntp=1, tempi=0, ntr=0,
> > ntt=3, ntwr=1000, cut=10, ntave=0,
> > dt=0.002, ntwx=1000, ntf=2, irest=1,
> > ntpr=100, taup=1, gamma_ln=5,
> > ioutfm=1,
> > /
> > &ewald
> > verbose=0, ew_type=0, eedtbdns=500,
> > netfrc=1, dsum_tol=1e-05, skinnb=2,
> > rsum_tol=5e-05, nbtell=0, nbflag=1,
> > frameon=1, vdwmeth=1, order=4, eedmeth=1,
> > /
> >
> > I am using cut=10 here, but I have also tried cut=8 with the same
> results.
> >
> > I hope all this helps pinpoint the source of the problem. Let me know if
> you
> > have any questions or if you have any suggestions.
> >
> > -Bill
> >
> >
> > On Mon, Feb 14, 2011 at 4:57 PM, Bongkeun Kim <bkim.chem.ucsb.edu>
> > wrote:
> >
> > > Hello Ross,
> > >
> > > I posted my answers between the lines.
> > >
> > > Quoting Ross Walker <ross.rosswalker.co.uk>:
> > >
> > > > Hi Bongkeun,
> > > >
> > > > Unfortunately it is going to be hard to figure out what is going on
> here
> > > > without doing some more digging. The error you see is somewhat
> > misleading
> > > > since it is effectively what happens if your system blows up. Some
> atom
> > > gets
> > > > a huge force on it etc etc. There are a number of things that can
> cause
> > > this
> > > > including everything from a bug in the code, issues with force field
> > > > parameters and even flakey hardware. Can you check a few things for
> > me.
> > > >
> > > > 1) Verify you definitely have bugfix.12 applied. Your output file
> should
> > > > say:
> > > >
> > > > |--------------------- INFORMATION ----------------------
> > > > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > > > | Version 2.1
> > > > | 12/20/2010
> > > >
> > > Yes, it is from bugfix.12
> > >
> > > > 2) Verify that you can reproduce this error if you start this
> calculation
> > > > again on the same hardware. Does it always occur at the same point.
> > > >
> > > No, I got this error randomly.
> > >
> > > > 3) Confirm exactly what hardware you are using. If this is NOT a
> C20XX
> > > > series board then the chance of it being flakey hardware are much
> > higher.
> > > >
> > > It's from C1070 family
> > >
> > > > 4) Finally try setting NTPR=1 and rerunning the calculation to see if
> it
> > > > crashes at the same place. That way we will be able to see exactly
> what
> > > > happened before the error was triggered.
> > > >
> > > I cannot see any error when using NTPR=1. This error came about once
> > > in 100ns randomly. I assume than heating on GPU may occur this error,
> > > so I separated runs in every 10ns and allowed 5 min idling to cool
> > > down GPUs. Each run spends about 10 hours.
> > > Thanks.
> > > Bongkeun Kim
> > >
> > > > Thanks,
> > > >
> > > > All the best
> > > > Ross
> > > >
> > > >> -----Original Message-----
> > > >> From: Bongkeun Kim [mailto:bkim.chem.ucsb.edu]
> > > >> Sent: Monday, February 14, 2011 11:12 AM
> > > >> To: amber
> > > >> Subject: [AMBER] max pairlist cutoff error on octahedral box
> > > >>
> > > >> Hello,
> > > >>
> > > >> I got the following error message when I run on AMBER 11 GPU.
> > > >> -------------------------------------------------------------
> > > >> NSTEP = 420000 TIME(PS) = 155540.000 TEMP(K) = 312.04
> PRESS
> > > >> = -187.3
> > > >> Etot = -18757.4114 EKtot = 4575.5386 EPtot =
> > > >> -23332.9500
> > > >> BOND = 58.8446 ANGLE = 136.7403 DIHED =
> > > >> 166.0070
> > > >> 1-4 NB = 56.7262 1-4 EEL = -31.1536 VDWAALS =
> > > >> 3080.9761
> > > >> EELEC = -26801.0907 EHBOND = 0.0000 RESTRAINT =
> > > >> 0.0000
> > > >> EKCMT = 2199.9567 VIRIAL = 2501.7076 VOLUME =
> > > >> 74625.8314
> > > >> Density =
> > > >> 0.9841
> > > >>
> > > >>
> > > >
> > >
>
> ----------------------------------------------------------------------------
> > > > --
> > > >>
> > > >> | ERROR: max pairlist cutoff must be less than unit cell max
> sphere
> > > > radius!
> > > >> -----------------------------------------------------------------
> > > >>
> > > >> This error occurred randomly and once I used the last rst file I can
> > > >> continue running. I already applied bugfix 12 and I used cutoff=8
> > > >> Please let me know how to avoid this error.
> > > >> Thank you.
> > > >> Bongkeun Kim
> > > >>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> AMBER mailing list
> > > >> AMBER.ambermd.org
> > > >> http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> >
> >
> >
> > --
> > Bill Miller III
> > Quantum Theory Project,
> > University of Florida
> > Ph.D. Graduate Student
> > 352-392-6715
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Bill Miller III
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-6715
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 15 2011 - 14:00:02 PST