Dear Ian,
Thanks for your comments and suggestions. A few replies:
*Do you get error 3 ERROR: max pairlist cutoff must be less than unit
cell max sphere radius! Right at the start of a run or during the
simulation? If it's at the start we have definitely seen this but never
during a run, so if it is the former then again this will be related to 1)
and 2) but if this occurs during the run then something really odd is
happening.*
We are indeed getting this error during the simulations. Checking just a
few of the times we have seen this we get max pairlist errors at step
318000, 858000, and 749000 in various windows. I have pasted an example of
this just below.
NSTEP = 858000 TIME(PS) = 1716.000 TEMP(K) = 308.66 PRESS =
76.6
Etot = -53744.4777 EKtot = 24145.6465 EPtot =
-77890.1242
BOND = 2038.4584 ANGLE = 8602.8268 DIHED =
5256.6803
1-4 NB = 1824.2334 1-4 EEL = 8901.8528 VDWAALS =
2646.5133
EELEC = -107160.9286 EHBOND = 0.0000 RESTRAINT =
0.2393
EAMBER (non-restraint) = -77890.3635
EKCMT = 6372.1802 VIRIAL = 5789.1127 VOLUME =
352390.6015
Density =
1.0053
------------------------------------------------------------------------------
NMR restraints: Bond = 0.239 Angle = 0.000 Torsion = 0.000
===============================================================================
| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!
*4) Nan's always a bad sign that something has gone hideously wrong, are
the Nan's electrostatics? vdW's? or bonded terms? If these are occurring in
the bonded terms then this looks again to me that it's shake related.*
The NaN's actually show up in the kinetic energy (and then consequently,
Etot and Temp). I have pasted an example just below. The density is
starting to drop when the NaN's occur as this is when the system explodes.
Just before that the density is fine (I have pasted a series of density
values below as well)
NSTEP = 4000 TIME(PS) = 8.000 TEMP(K) = NaN PRESS =
1055.2
Etot = NaN EKtot = NaN EPtot =
-76287.2741
BOND = 2141.8520 ANGLE = 8462.7845 DIHED =
5315.4360
1-4 NB = 1790.4408 1-4 EEL = 8768.4739 VDWAALS =
2190.5317
EELEC = -104957.3664 EHBOND = 0.0000 RESTRAINT =
0.5734
EAMBER (non-restraint) = -76287.8475
EKCMT = 4761.7586 VIRIAL = -4018.8030 VOLUME =
385407.6003
Density =
0.9195
Density = 1.0070
Density =
1.0259
Density =
1.0172
Density =
1.0127
Density =
0.9195
Density =
0.8502
Density =
0.7922
Density =
0.7407
*2) What timestep and shake parameters are you using? If you are using HMR
I'd immediately drop the timestep to 2fs or even 1fs to see if this is the
issue.*
We are using a 2 fs time step, ntc=ntf=2 in our simulations. I've actually
had very good results with HMR in other unrelated simulations.
For periodic box errors we are getting them at varying positions in the
trajectory (again after an okay first 50 ns). For example at step counts:
1727000, 1490000, 582000 for a few windows. I've attached a PDF image of
our box edge lengths in a simulation that fails from PBC to show that there
does not seem to be any drastic fluctuation in the dimensions near where
the simulation stops.
*3) What pressure coupling algorithm are you using? My group always use
Berendsen when running these type of simulations as we have never tried it
with the MC barostsat, in principle it should not make any difference but
you never know.*
We are using the Berendsen pressure coupling algorithm as well.
*1) Run the simulation for 100ns in one go, from your e-mail you have cut
down the length of simulation, by going in the opposite direction if you
still get the errors then it removes the possibility that it is a restart
issue. If you can run multiple 100ns without this error then it would be
giving a strong indication that the problem is related to the issue of
restarts and may be a lot easier to diagnose.*
Thanks for this suggestion. We are trying this now with a few of the
windows that have failed when restarting them to see if that makes a
difference.
Kind regards,
Joe
------
Joseph Baker, PhD
Assistant Professor
Department of Chemistry
C101 Science Complex
The College of New Jersey
Ewing, NJ 08628
Phone: (609) 771-3173
Web: http://bakerj.pages.tcnj.edu/
On Sun, Apr 29, 2018 at 5:28 AM, Gould, Ian R <i.gould.imperial.ac.uk>
wrote:
> Dear Joe,
>
> We have been routinely running umbrella sampling simulations on GPU's of
> the kind you describe and have not experienced these sorts of problems in a
> very long time.
> To address the errors you have posted
>
> (1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory access
> was encountered
>
> When the cuda code bombs out with this type of error we have found, at
> least in my group, that these are usually related to errors in Shake though
> due to the cryptic nature of how the cuda code exits it can be very
> difficult to ascribe a correct reason for the error.
>
> (2) ERROR: Calculation halted. Periodic box dimensions have changed too
> much from their initial values.
>
> This error and your error 3 do tend to suggest that the system is far from
> equilibrium, I know in your e-mail you say that you have run the systems
> for 50ns to reach equilibration
>
> Do you get error 3 ERROR: max pairlist cutoff must be less than unit
> cell max sphere radius! Right at the start of a run or during the
> simulation? If it's at the start we have definitely seen this but never
> during a run, so if it is the former then again this will be related to 1)
> and 2) but if this occurs during the run then something really odd is
> happening.
>
> 4) Nan's always a bad sign that something has gone hideously wrong, are
> the Nan's electrostatics? vdW's? or bonded terms? If these are occurring in
> the bonded terms then this looks again to me that it's shake related.
>
> Potentially a big thing in your e-mail is that this is occurring only on
> restarts? A long time ago there was an issue with pmemd.cuda which would
> cause failures of this kind upon restarts , Scott and Ross if you are
> reading this could you pitch in here, but that was resolved way back in
> Amber12 to the best of my recollection.
>
> Some thoughts on things to try
> 1) Run the simulation for 100ns in one go, from your e-mail you have cut
> down the length of simulation, by going in the opposite direction if you
> still get the errors then it removes the possibility that it is a restart
> issue. If you can run multiple 100ns without this error then it would be
> giving a strong indication that the problem is related to the issue of
> restarts and may be a lot easier to diagnose.
>
> 2) What timestep and shake parameters are you using? If you are using HMR
> I'd immediately drop the timestep to 2fs or even 1fs to see if this is the
> issue.
>
> 3) What pressure coupling algorithm are you using? My group always use
> Berendsen when running these type of simulations as we have never tried it
> with the MC barostsat, in principle it should not make any difference but
> you never know.
>
> HTH
> Ian
>
> Tyrell: I'm surprised you didn't come here sooner.
> Roy: It's not an easy thing to meet your maker.
> Tyrell: What could he do for you?
> Roy: Can the maker repair what he makes?
> Bladerunner
>
> --
>
> Professor Ian R Gould, FRSC.
> Professor of Computational Chemical Biology
> Department of Chemistry
> Imperial College London
> Exhibition Road
> London
> SW7 2AY
>
> E-mail i.gould.imperial.ac.uk
> http://www3.imperial.ac.uk/people/i.gould
> Tel +44 (0)207 594 5809
>
>
> On 28/04/2018, 19:55, "Baker, Joseph" <bakerj.tcnj.edu> wrote:
>
> Hi all,
>
> We are trying to run some umbrella simulations with a small molecule
> restrained in the z-direction in a number of windows with the molecule
> moving through a POPE membrane (lipid14) using Amber16 pmemd.cuda. We
> are
> encountering a number of errors in some windows (not all) that include
> the
> following:
>
> (1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory
> access
> was encountered
>
> (2) ERROR: Calculation halted. Periodic box dimensions have changed
> too
> much from their initial values.
>
> (3) ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
>
> (4) And occasionally NaN showing up for various energy terms in the
> output
> log file, in which case the system keeps running, but when we view it
> in
> several windows the system has completely "exploded".
>
> The strange thing (to me) is that each window has already been run for
> 50
> ns with no problems on the GPU (suggesting they are equilibrated), and
> when
> looking at the systems it does not appear there are any large
> fluctuations
> of box size at the point that failures are occurring. Also, windows
> that
> fail do not look very different compared to windows that continue to
> run
> okay in the second 50 ns (aside from the ones that "explode" with NaN
> errors).
>
> Our collaborator at another site has seen the same errors when running
> our
> system, and has also seen the same errors for their own system of a
> different small molecule moving through the POPE membrane. In their
> case,
> they ran their first 50 ns of each window on the CPU (pmemd.MPI no
> failures), and then when they switched to GPUs they started to see the
> failures in the second 50 ns.
> I should also add that at our site we have spot-checked one of the
> failing
> windows by continuing it on the CPU instead of the GPU for the 2nd 50
> ns,
> and that works fine as well. So it appears that problems arise in only
> some
> windows and only when trying to run the second 50 ns of these
> simulations
> on a GPU device.
>
> We have tried a number of solutions (running shorter simulations to
> restart
> more frequently to attempt to fix the periodic box type errors,
> turning off
> the umbrella restraints to see if that was the problem, etc.), but
> have not
> been able to resolve these issues, and are at a bit of a loss for what
> might be going on in our case.
>
> Any advice, suggestions for tests, etc. would be greatly appreciated to
> track down what might be going on when trying to extend these systems
> on
> the GPU! Thanks!
>
> Kind regards,
> Joe
>
> ------
> Joseph Baker, PhD
> Assistant Professor
> Department of Chemistry
> C101 Science Complex
> The College of New Jersey
> Ewing, NJ 08628
> Phone: (609) 771-3173
> Web: http://bakerj.pages.tcnj.edu/
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Apr 29 2018 - 12:00:05 PDT