So these are:
1) All Langevin simulations?
2) ig=-1 so no two simulations are the same?
3) Everything is load-balanced so it can never ever ever be the same between two runs?
In this scenario, it's not possible to isolate the cause IMO. You need to either run on the same number of multiple GPUs and explicitly set the random seed until you see a failure that repros or run on a single process on the CPU.
For me to fix things at the GPU end if you go that route, I need an *explicit* repro with everything explictly set in a single GPU. It'll take all of 20 minutes to figure it out with such a simulation (just checked in the fix to a repro sent to me at 11:30 last night for example). Without it, it's next to impossible.
Scott
-----Original Message-----
From: Bill Miller III [mailto:brmilleriii.gmail.com]
Sent: Tuesday, February 15, 2011 04:13
To: AMBER Mailing List
Subject: Re: [AMBER] max pairlist cutoff error on octahedral box
I have also seen this error randomly on a system I am running using regular
pmemd (i.e. not the GPU version) using Amber 11 on 256 processors on Athena.
I have seen the error on four different systems I have been running. The
systems are all fairly large (up to 350,000 atoms). The error never occurs
at the same place twice. However, the error has occurred more frequently for
me than Bongkeun. It sometimes does not happen for several nanoseconds, but
can also happen many times per nanosecond. The error:
| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!
is written once at the very end of the mdout file, and a couple of hundred
times in the STDOUT file. We also turned verbose on to see what happened
with the forces right before the error. The last step showed NaN for
essentially all energy values, followed of course by the error message.
NET FORCE PER ATOM: 0.1203E-05 0.6295E-06 0.2283E-05
Evdw = -1296.057283476214
Ehbond = 0.000000000000
Ecoulomb = -220472.976904191100
Iso virial = 199523.479901402000
Eevir vs. Ecoulomb = 5.356456894359
a,b,c,volume now equal to 168.045 168.045 168.045 3653026.946
NET FORCE PER ATOM: NaN NaN NaN
Evdw = -1296.057285985989
Ehbond = 0.000000000000
Ecoulomb = NaN
Iso virial = NaN
Eevir vs. Ecoulomb = 0.000000000000
a,b,c,volume now equal to NaN NaN NaN NaN
| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!
This shows that, as expected, the system is blowing up right before the job
dies. Below is the pmemd mdin file used for this simulation (with verbose
turned off here, obviously).
MD Run in pmemd.
&cntrl
nstlim=250000, owtnm='O', hwtnm1='H1',
dielc=1, nrespa=1, temp0=310,
tol=1e-05, vlimit=20, iwrap=1, ntc=2,
ig=-1, pres0=1, ntb=2, ntrx=1,
ibelly=0, nmropt=0, hwtnm2='H2',
imin=0, ntxo=1, watnam='WAT', igb=0,
comp=44.6, jfastw=0, ntx=5, ipol=0,
nscm=1000, ntp=1, tempi=0, ntr=0,
ntt=3, ntwr=1000, cut=10, ntave=0,
dt=0.002, ntwx=1000, ntf=2, irest=1,
ntpr=100, taup=1, gamma_ln=5,
ioutfm=1,
/
&ewald
verbose=0, ew_type=0, eedtbdns=500,
netfrc=1, dsum_tol=1e-05, skinnb=2,
rsum_tol=5e-05, nbtell=0, nbflag=1,
frameon=1, vdwmeth=1, order=4, eedmeth=1,
/
I am using cut=10 here, but I have also tried cut=8 with the same results.
I hope all this helps pinpoint the source of the problem. Let me know if you
have any questions or if you have any suggestions.
-Bill
On Mon, Feb 14, 2011 at 4:57 PM, Bongkeun Kim <bkim.chem.ucsb.edu> wrote:
> Hello Ross,
>
> I posted my answers between the lines.
>
> Quoting Ross Walker <ross.rosswalker.co.uk>:
>
> > Hi Bongkeun,
> >
> > Unfortunately it is going to be hard to figure out what is going on here
> > without doing some more digging. The error you see is somewhat misleading
> > since it is effectively what happens if your system blows up. Some atom
> gets
> > a huge force on it etc etc. There are a number of things that can cause
> this
> > including everything from a bug in the code, issues with force field
> > parameters and even flakey hardware. Can you check a few things for me.
> >
> > 1) Verify you definitely have bugfix.12 applied. Your output file should
> > say:
> >
> > |--------------------- INFORMATION ----------------------
> > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > | Version 2.1
> > | 12/20/2010
> >
> Yes, it is from bugfix.12
>
> > 2) Verify that you can reproduce this error if you start this calculation
> > again on the same hardware. Does it always occur at the same point.
> >
> No, I got this error randomly.
>
> > 3) Confirm exactly what hardware you are using. If this is NOT a C20XX
> > series board then the chance of it being flakey hardware are much higher.
> >
> It's from C1070 family
>
> > 4) Finally try setting NTPR=1 and rerunning the calculation to see if it
> > crashes at the same place. That way we will be able to see exactly what
> > happened before the error was triggered.
> >
> I cannot see any error when using NTPR=1. This error came about once
> in 100ns randomly. I assume than heating on GPU may occur this error,
> so I separated runs in every 10ns and allowed 5 min idling to cool
> down GPUs. Each run spends about 10 hours.
> Thanks.
> Bongkeun Kim
>
> > Thanks,
> >
> > All the best
> > Ross
> >
> >> -----Original Message-----
> >> From: Bongkeun Kim [mailto:bkim.chem.ucsb.edu]
> >> Sent: Monday, February 14, 2011 11:12 AM
> >> To: amber
> >> Subject: [AMBER] max pairlist cutoff error on octahedral box
> >>
> >> Hello,
> >>
> >> I got the following error message when I run on AMBER 11 GPU.
> >> -------------------------------------------------------------
> >> NSTEP = 420000 TIME(PS) = 155540.000 TEMP(K) = 312.04 PRESS
> >> = -187.3
> >> Etot = -18757.4114 EKtot = 4575.5386 EPtot =
> >> -23332.9500
> >> BOND = 58.8446 ANGLE = 136.7403 DIHED =
> >> 166.0070
> >> 1-4 NB = 56.7262 1-4 EEL = -31.1536 VDWAALS =
> >> 3080.9761
> >> EELEC = -26801.0907 EHBOND = 0.0000 RESTRAINT =
> >> 0.0000
> >> EKCMT = 2199.9567 VIRIAL = 2501.7076 VOLUME =
> >> 74625.8314
> >> Density =
> >> 0.9841
> >>
> >>
> >
> ----------------------------------------------------------------------------
> > --
> >>
> >> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> > radius!
> >> -----------------------------------------------------------------
> >>
> >> This error occurred randomly and once I used the last rst file I can
> >> continue running. I already applied bugfix 12 and I used cutoff=8
> >> Please let me know how to avoid this error.
> >> Thank you.
> >> Bongkeun Kim
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Bill Miller III
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-6715
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 15 2011 - 08:30:02 PST