[AMBER] vlimit=10 compromise for Amber 20 error: "an illegal memory access was encountered launching kernel kClearForces"?

From: Liao <liaojunzhuo.aliyun.com>
Date: Wed, 14 Oct 2020 02:29:23 +0800

Dear Amber Users,

I’ve most recently started to use Amber 20 and the ff19sb force field along with opc water.

With pmemd.cuda, I’ve ran into this problem, that it almost randomly gives this error message and crashes. "Error: an illegal memory access was encountered launching kernel kClearForces"

After extensive debugging and performing various control runs, reading from the forum, here’s what I discovered:

1.If I set nptr=1, only at such a small number could I see that the system simply blows up all of a sudden with no hint, in terms of kinetic energy and temperature. Please see below the sudden blowup:
 NSTEP = 15979 TIME(PS) = 3781.958 TEMP(K) = 321.98 PRESS = 0.0
 Etot = -166339.6957 EKtot = 34967.1914 EPtot = -201306.8871
 BOND = 1332.6273 ANGLE = 3654.6213 DIHED = 2344.1866
 UB = 0.0000 IMP = 0.0000 CMAP = 300.2194
 1-4 NB = 1602.3619 1-4 EEL = 15665.8723 VDWAALS = 23153.0389
 EELEC = -249359.8148 EHBOND = 0.0000 RESTRAINT = 0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME = 523407.2851 Density = 1.0341
 ------------------------------------------------------------------------------
 NSTEP = 15980 TIME(PS) = 3781.960 TEMP(K) = 12807.45 PRESS = 0.0
 Etot = 1200379.6146 EKtot = 1390876.8750 EPtot = -190497.2604
 BOND = 1331.5817 ANGLE = 3724.0317 DIHED = 2343.1848
 UB = 0.0000 IMP = 0.0000 CMAP = 300.3352
 1-4 NB = 4362.7163 1-4 EEL = 15659.0224 VDWAALS = 31195.5756
 EELEC = -249413.7082 EHBOND = 0.0000 RESTRAINT = 0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME = 523407.2851 Density = 1.0341
 ------------------------------------------------------------------------------
 NSTEP = 15981 TIME(PS) = 3781.962 TEMP(K) = NaN PRESS = 0.0
 Etot = NaN EKtot = NaN EPtot = **************
 BOND = 78051.6147 ANGLE = 5028.6654 DIHED = 2347.0270
 UB = 0.0000 IMP = 0.0000 CMAP = 298.6265
 1-4 NB = 1615.2760 1-4 EEL = 15623.6489 VDWAALS = **************
 EELEC = -249473.1008 EHBOND = 0.0000 RESTRAINT = 0.0000
 EKCMT = 0.0000 VIRIAL = 0.0000 VOLUME = 523407.2851 Density = 1.0341

2. Using the Langevin theromostat (gamma=2), a random start seed will make it blow up randomly, a fixed seed will make it crash at the exact same place (though with ntpr smaller than 10 or so, the same seed is giving a slightly different temperature, though crashing at similar times). This is what made debugging so hard at first, that it may crash after 2 minutes, or after 1 hour, until I used a fixed seed.

3. At 310K, a timestep of 0.001 seems less likely to crash than 0.002, but still does. Reducing the temperature to 273K doesn’t help.

I eventually found 2 resolutions to the problem.
The 1st way is when I switch back to ff14SB (with nothing else changed but the random seed), by regenerating the prmtop file, it runs and completes normally. But since I’ve got Amber 20, I still want to try use the ff19sb. Running the MD in the Amber18 pmemd gives a similar error (Error: an illegal memory access was encountered launching kernel kNLSkinTest), so it doesn’t seem to be the software.
The 2nd way is to set vlimit=10. Even at vlimit=12, it gives an energy value blowup (but the simulation kept going until finish though). The output file didn’t give any warning messages, though when i compare the results of vlimit=10 from the crashed one, the energy value gets to be slightly different starting from near the crash point.

My concern here is, is vlimit=10 a good enough long-term solution? Since the default is 20. And why is this happening? It’s a typical ligand-protein system, and tried several ligands the same problem.

Thanks for reading!
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 13 2020 - 11:30:02 PDT
Custom Search