Re: [AMBER] vlimit=10 compromise for Amber 20 error: "an illegal memory access was encountered launching kernel kClearForces"? from Liao on 2020-12-16 (Amber Archive Dec 2020)

From: Liao <liaojunzhuo.aliyun.com>
Date: Wed, 16 Dec 2020 11:21:12 -0600

Hi David,

I’m happy to report that after applying the bug fix and reinstalling AMBER, the problem is gone. Could finally spare about half a day to do all this yesterday, since last week.
With the vlimit=10 restriction removed, I’ve tried multiple 10 ns runs after the bug fix, and all are OK. I’ve ran the same system with the same start frame and same random seed with the old installation as control, and the problem occurred like before.
Looks like I can really migrate to ff19sb now.

Thank you!

> On Dec 9, 2020, at 2:58 AM, David Cerutti <dscerutti.gmail.com> wrote:
>
> I have just found a bug in the CMAP setup for bond work units that would
> appear in some systems and cause random crashes after 35k, 55k, or even
> 100k steps. The solution is to go into
> ${AMBERHOME}/src/pmemd/src/cuda/bondRemap.cpp, and at or about line 395
> change the following code to read:
>
> if (objcode == ANGL_CODE || objcode == NMR3_CODE || objcode == CMAP_CODE)
> {
> if ((objcode == ANGL_CODE && ispresent[2] == 0) ||
> (objcode == CMAP_CODE && ispresent[4] == 0)) {
> bw->atomList[nunitatom] = p1atm[objID];
> nunitatom++;
> }
> if (unitMapCounts == NULL) {
> itmp = unitMap->data;
> itmp[p1atm[objID]] = unitID;
> }
> else {
> unitMap->map[p1atm[objID]][unitMapCounts->data[p1atm[objID]]] =
> unitID;
> unitMapCounts->data[p1atm[objID]] += 1;
> }
> }
>
> The crucial detail is that "if (ispresent[2] == 0)" changes to two pairs of
> conditions involving the objcode and the array recording whether the
> object's atom is already present in the bond work unit's import group. I
> will be patching the code ASAP, but if you can make this change and
> recompile, then let us know whether this solves your problem, that would be
> ideal. (And, please get the patched version when it is available, there is
> a __syncwarp() I've added that may or may not be important which I added
> while I was looking at this as a pure race condition.) I think that the bug
> in the C++ layer causes some bad tables to be set up for the GPU which then
> leads to a race condition that causes what you are probably seeing. The
> rest of the code is probably fine, but the __syncwarp() I added is not
> going to affect performance and makes me feel better, so make the above
> change, see how that goes, and then DL the patch when we have it.
>
> Dave
>
>
>
>> On Sun, Nov 22, 2020 at 9:21 PM David A Case <david.case.rutgers.edu> wrote:
>>
>>> On Mon, Nov 23, 2020, Liao wrote:
>>>
>>> The starting structures have been docked ligands in a crystal structure,
>>> that I had added back in missing amino acid residues manually
>>
>> I don't have an answer, but here are some ideas/questions:
>>
>> 1. How did you parameterize the ligand? If with GAFF, did you use
>> version 1 or version 2 of gaff? The reason is that GAFF1 has zero LJ
>> terms on some hydrogens, whereas GAFF2 avoids this.
>>
>> Have you ever encountered what looks like the same protein with just
>> protain + water (+ions), but no ligand? (I'm just trying to narrow down
>> the problem, and enable as simple as possible a test case to look at.)
>>
>> 2. you might try the "lmod" action in parmed to create a revised prmtop
>> file -- this removes all zero LJ terms, whereever they might come from.
>>
>> 3. Since GPU runs should be deterministic, can you look at the structure
>> at exactly step 15979? Does the "check" action in cpptraj offer any
>> clues for that structure?
>>
>> If it doesn't crash at exactly the same step every time, that's also a
>> bit odd -- would be good to know, one way or another. Also worth
>> knowing: does this happen with different GPU cards? I understand that
>> it is odd that only ff19SB has this problem, but it's possible to have
>> either a hardware or software bug that comes into play when CMAP is
>> turned on.
>>
>>>
>>> Now, working on a new protein-ligand system, I started out with ff14SB,
>>> runs normally as expected (Implicit water, HMR prmtop files used). When
>>> I decided to try ff19SB also, the simulation blows up again quite
>>> quickly.
>>
>> Can you say a bit more here: what does "quite quickly" mean? What would
>> be ideal would to have two prmtop files (one for ff14SB, one for
>> ff19SB), a common (restart) input file, an mdin file, precise information
>> on
>> what GPU was being used, and information about what step to expect the
>> odd behavior at. For this exercise, don't use ig=-1, but choose a
>> random seed so that others can try to reproduce the problem. (Apologies
>> if I am mis-remembering symptoms you have reported before.)
>>
>> ...thx...dac
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Dec 16 2020 - 09:30:03 PST