Re: [AMBER] vlimit=10 compromise for Amber 20 error: "an illegal memory access was encountered launching kernel kClearForces"? from David Cerutti on 2020-12-09 (Amber Archive Dec 2020)

From: David Cerutti <dscerutti.gmail.com>
Date: Wed, 9 Dec 2020 03:57:45 -0500

I have just found a bug in the CMAP setup for bond work units that would
appear in some systems and cause random crashes after 35k, 55k, or even
100k steps. The solution is to go into
${AMBERHOME}/src/pmemd/src/cuda/bondRemap.cpp, and at or about line 395
change the following code to read:

  if (objcode == ANGL_CODE || objcode == NMR3_CODE || objcode == CMAP_CODE)
{
    if ((objcode == ANGL_CODE && ispresent[2] == 0) ||
        (objcode == CMAP_CODE && ispresent[4] == 0)) {
      bw->atomList[nunitatom] = p1atm[objID];
      nunitatom++;
    }
    if (unitMapCounts == NULL) {
      itmp = unitMap->data;
      itmp[p1atm[objID]] = unitID;
    }
    else {
      unitMap->map[p1atm[objID]][unitMapCounts->data[p1atm[objID]]] =
unitID;
      unitMapCounts->data[p1atm[objID]] += 1;
    }
  }

The crucial detail is that "if (ispresent[2] == 0)" changes to two pairs of
conditions involving the objcode and the array recording whether the
object's atom is already present in the bond work unit's import group. I
will be patching the code ASAP, but if you can make this change and
recompile, then let us know whether this solves your problem, that would be
ideal. (And, please get the patched version when it is available, there is
a __syncwarp() I've added that may or may not be important which I added
while I was looking at this as a pure race condition.) I think that the bug
in the C++ layer causes some bad tables to be set up for the GPU which then
leads to a race condition that causes what you are probably seeing. The
rest of the code is probably fine, but the __syncwarp() I added is not
going to affect performance and makes me feel better, so make the above
change, see how that goes, and then DL the patch when we have it.

Dave

On Sun, Nov 22, 2020 at 9:21 PM David A Case <david.case.rutgers.edu> wrote:

> On Mon, Nov 23, 2020, Liao wrote:
> >
> >The starting structures have been docked ligands in a crystal structure,
> >that I had added back in missing amino acid residues manually
>
> I don't have an answer, but here are some ideas/questions:
>
> 1. How did you parameterize the ligand? If with GAFF, did you use
> version 1 or version 2 of gaff? The reason is that GAFF1 has zero LJ
> terms on some hydrogens, whereas GAFF2 avoids this.
>
> Have you ever encountered what looks like the same protein with just
> protain + water (+ions), but no ligand? (I'm just trying to narrow down
> the problem, and enable as simple as possible a test case to look at.)
>
> 2. you might try the "lmod" action in parmed to create a revised prmtop
> file -- this removes all zero LJ terms, whereever they might come from.
>
> 3. Since GPU runs should be deterministic, can you look at the structure
> at exactly step 15979? Does the "check" action in cpptraj offer any
> clues for that structure?
>
> If it doesn't crash at exactly the same step every time, that's also a
> bit odd -- would be good to know, one way or another. Also worth
> knowing: does this happen with different GPU cards? I understand that
> it is odd that only ff19SB has this problem, but it's possible to have
> either a hardware or software bug that comes into play when CMAP is
> turned on.
>
> >
> > Now, working on a new protein-ligand system, I started out with ff14SB,
> > runs normally as expected (Implicit water, HMR prmtop files used). When
> > I decided to try ff19SB also, the simulation blows up again quite
> > quickly.
>
> Can you say a bit more here: what does "quite quickly" mean? What would
> be ideal would to have two prmtop files (one for ff14SB, one for
> ff19SB), a common (restart) input file, an mdin file, precise information
> on
> what GPU was being used, and information about what step to expect the
> odd behavior at. For this exercise, don't use ig=-1, but choose a
> random seed so that others can try to reproduce the problem. (Apologies
> if I am mis-remembering symptoms you have reported before.)
>
> ...thx...dac
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Dec 09 2020 - 01:00:04 PST