Dear Joe,
We have been routinely running umbrella sampling simulations on GPU's of the kind you describe and have not experienced these sorts of problems in a very long time.
To address the errors you have posted
(1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory access
was encountered
When the cuda code bombs out with this type of error we have found, at least in my group, that these are usually related to errors in Shake though due to the cryptic nature of how the cuda code exits it can be very difficult to ascribe a correct reason for the error.
(2) ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
This error and your error 3 do tend to suggest that the system is far from equilibrium, I know in your e-mail you say that you have run the systems for 50ns to reach equilibration
Do you get error 3 ERROR: max pairlist cutoff must be less than unit cell max sphere radius! Right at the start of a run or during the simulation? If it's at the start we have definitely seen this but never during a run, so if it is the former then again this will be related to 1) and 2) but if this occurs during the run then something really odd is happening.
4) Nan's always a bad sign that something has gone hideously wrong, are the Nan's electrostatics? vdW's? or bonded terms? If these are occurring in the bonded terms then this looks again to me that it's shake related.
Potentially a big thing in your e-mail is that this is occurring only on restarts? A long time ago there was an issue with pmemd.cuda which would cause failures of this kind upon restarts , Scott and Ross if you are reading this could you pitch in here, but that was resolved way back in Amber12 to the best of my recollection.
Some thoughts on things to try
1) Run the simulation for 100ns in one go, from your e-mail you have cut down the length of simulation, by going in the opposite direction if you still get the errors then it removes the possibility that it is a restart issue. If you can run multiple 100ns without this error then it would be giving a strong indication that the problem is related to the issue of restarts and may be a lot easier to diagnose.
2) What timestep and shake parameters are you using? If you are using HMR I'd immediately drop the timestep to 2fs or even 1fs to see if this is the issue.
3) What pressure coupling algorithm are you using? My group always use Berendsen when running these type of simulations as we have never tried it with the MC barostsat, in principle it should not make any difference but you never know.
HTH
Ian
Tyrell: I'm surprised you didn't come here sooner.
Roy: It's not an easy thing to meet your maker.
Tyrell: What could he do for you?
Roy: Can the maker repair what he makes?
Bladerunner
--
Professor Ian R Gould, FRSC.
Professor of Computational Chemical Biology
Department of Chemistry
Imperial College London
Exhibition Road
London
SW7 2AY
E-mail i.gould.imperial.ac.uk
http://www3.imperial.ac.uk/people/i.gould
Tel +44 (0)207 594 5809
On 28/04/2018, 19:55, "Baker, Joseph" <bakerj.tcnj.edu> wrote:
Hi all,
We are trying to run some umbrella simulations with a small molecule
restrained in the z-direction in a number of windows with the molecule
moving through a POPE membrane (lipid14) using Amber16 pmemd.cuda. We are
encountering a number of errors in some windows (not all) that include the
following:
(1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory access
was encountered
(2) ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
(3) ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!
(4) And occasionally NaN showing up for various energy terms in the output
log file, in which case the system keeps running, but when we view it in
several windows the system has completely "exploded".
The strange thing (to me) is that each window has already been run for 50
ns with no problems on the GPU (suggesting they are equilibrated), and when
looking at the systems it does not appear there are any large fluctuations
of box size at the point that failures are occurring. Also, windows that
fail do not look very different compared to windows that continue to run
okay in the second 50 ns (aside from the ones that "explode" with NaN
errors).
Our collaborator at another site has seen the same errors when running our
system, and has also seen the same errors for their own system of a
different small molecule moving through the POPE membrane. In their case,
they ran their first 50 ns of each window on the CPU (pmemd.MPI no
failures), and then when they switched to GPUs they started to see the
failures in the second 50 ns.
I should also add that at our site we have spot-checked one of the failing
windows by continuing it on the CPU instead of the GPU for the 2nd 50 ns,
and that works fine as well. So it appears that problems arise in only some
windows and only when trying to run the second 50 ns of these simulations
on a GPU device.
We have tried a number of solutions (running shorter simulations to restart
more frequently to attempt to fix the periodic box type errors, turning off
the umbrella restraints to see if that was the problem, etc.), but have not
been able to resolve these issues, and are at a bit of a loss for what
might be going on in our case.
Any advice, suggestions for tests, etc. would be greatly appreciated to
track down what might be going on when trying to extend these systems on
the GPU! Thanks!
Kind regards,
Joe
------
Joseph Baker, PhD
Assistant Professor
Department of Chemistry
C101 Science Complex
The College of New Jersey
Ewing, NJ 08628
Phone: (609) 771-3173
Web: http://bakerj.pages.tcnj.edu/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Apr 29 2018 - 02:30:02 PDT