Re: [AMBER] variety of errors in membrane/umbrella sampling simulations from Gould, Ian R on 2018-04-29 (Amber Archive Apr 2018)

From: Gould, Ian R <i.gould.imperial.ac.uk>
Date: Sun, 29 Apr 2018 09:28:43 +0000

Dear Joe,

We have been routinely running umbrella sampling simulations on GPU's of the kind you describe and have not experienced these sorts of problems in a very long time.
To address the errors you have posted

(1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory access
was encountered

When the cuda code bombs out with this type of error we have found, at least in my group, that these are usually related to errors in Shake though due to the cryptic nature of how the cuda code exits it can be very difficult to ascribe a correct reason for the error.

(2) ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.

This error and your error 3 do tend to suggest that the system is far from equilibrium, I know in your e-mail you say that you have run the systems for 50ns to reach equilibration

Do you get error 3 ERROR: max pairlist cutoff must be less than unit cell max sphere radius! Right at the start of a run or during the simulation? If it's at the start we have definitely seen this but never during a run, so if it is the former then again this will be related to 1) and 2) but if this occurs during the run then something really odd is happening.

4) Nan's always a bad sign that something has gone hideously wrong, are the Nan's electrostatics? vdW's? or bonded terms? If these are occurring in the bonded terms then this looks again to me that it's shake related.

Potentially a big thing in your e-mail is that this is occurring only on restarts? A long time ago there was an issue with pmemd.cuda which would cause failures of this kind upon restarts , Scott and Ross if you are reading this could you pitch in here, but that was resolved way back in Amber12 to the best of my recollection.

Some thoughts on things to try
1) Run the simulation for 100ns in one go, from your e-mail you have cut down the length of simulation, by going in the opposite direction if you still get the errors then it removes the possibility that it is a restart issue. If you can run multiple 100ns without this error then it would be giving a strong indication that the problem is related to the issue of restarts and may be a lot easier to diagnose.

2) What timestep and shake parameters are you using? If you are using HMR I'd immediately drop the timestep to 2fs or even 1fs to see if this is the issue.

3) What pressure coupling algorithm are you using? My group always use Berendsen when running these type of simulations as we have never tried it with the MC barostsat, in principle it should not make any difference but you never know.

HTH
Ian

Tyrell: I'm surprised you didn't come here sooner.
Roy: It's not an easy thing to meet your maker.
Tyrell: What could he do for you?
Roy: Can the maker repair what he makes?
Bladerunner

--

Professor Ian R Gould, FRSC.
Professor of Computational Chemical Biology
Department of Chemistry
Imperial College London
Exhibition Road
London
SW7 2AY

E-mail i.gould.imperial.ac.uk
http://www3.imperial.ac.uk/people/i.gould
Tel +44 (0)207 594 5809

On 28/04/2018, 19:55, "Baker, Joseph" <bakerj.tcnj.edu> wrote:

    Hi all,

    We are trying to run some umbrella simulations with a small molecule
    restrained in the z-direction in a number of windows with the molecule
    moving through a POPE membrane (lipid14) using Amber16 pmemd.cuda. We are
    encountering a number of errors in some windows (not all) that include the
    following:

    (1) Reason: cudaMemcpy GpuBuffer::Download failed an illegal memory access
    was encountered

    (2) ERROR: Calculation halted. Periodic box dimensions have changed too
    much from their initial values.

    (3) ERROR: max pairlist cutoff must be less than unit cell max sphere
    radius!

    (4) And occasionally NaN showing up for various energy terms in the output
    log file, in which case the system keeps running, but when we view it in
    several windows the system has completely "exploded".

    The strange thing (to me) is that each window has already been run for 50
    ns with no problems on the GPU (suggesting they are equilibrated), and when
    looking at the systems it does not appear there are any large fluctuations
    of box size at the point that failures are occurring. Also, windows that
    fail do not look very different compared to windows that continue to run
    okay in the second 50 ns (aside from the ones that "explode" with NaN
    errors).

    Our collaborator at another site has seen the same errors when running our
    system, and has also seen the same errors for their own system of a
    different small molecule moving through the POPE membrane. In their case,
    they ran their first 50 ns of each window on the CPU (pmemd.MPI no
    failures), and then when they switched to GPUs they started to see the
    failures in the second 50 ns.
    I should also add that at our site we have spot-checked one of the failing
    windows by continuing it on the CPU instead of the GPU for the 2nd 50 ns,
    and that works fine as well. So it appears that problems arise in only some
    windows and only when trying to run the second 50 ns of these simulations
    on a GPU device.

    We have tried a number of solutions (running shorter simulations to restart
    more frequently to attempt to fix the periodic box type errors, turning off
    the umbrella restraints to see if that was the problem, etc.), but have not
    been able to resolve these issues, and are at a bit of a loss for what
    might be going on in our case.

    Any advice, suggestions for tests, etc. would be greatly appreciated to
    track down what might be going on when trying to extend these systems on
    the GPU! Thanks!

    Kind regards,
    Joe

    ------
    Joseph Baker, PhD
    Assistant Professor
    Department of Chemistry
    C101 Science Complex
    The College of New Jersey
    Ewing, NJ 08628
    Phone: (609) 771-3173
    Web: http://bakerj.pages.tcnj.edu/
    _______________________________________________
    AMBER mailing list
    AMBER.ambermd.org
    http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Apr 29 2018 - 02:30:02 PDT