Re: [AMBER] pmemd.cuda error: invalid argument launching kernel kgBuildSpecial2RestNBPreList from Gross, Craig via AMBER on 2025-03-24 (Amber Archive Mar 2025)

From: Gross, Craig via AMBER <amber.ambermd.org>
Date: Mon, 24 Mar 2025 21:58:35 +0000

Thank you for the pointers, this was extremely helpful. The results are as follows:

1. Running pmemd.cuda_DPFP instead works correctly.
2. Modifying the lines as stated at 1309 does not work correctly (same error as before) and does not produce a printf output.
3. Modifying the analogous lines in ik_Build1264NBList near line 1020 works correctly and produces a printf output of "gpu->blocks: 132".

I'm happy to patch our installation accordingly, but should we anticipate any problems in doing so? Is there a more robust solution?

Best,
Craig Gross

On Fri, 2025-03-21 at 22:01 +0000, Patricio Barletta via AMBER wrote:
I was able to run your example on a GH200. Attaching the output:

https://urldefense.com/v3/__https://www.dropbox.com/scl/fi/7ztge4orkmsnc992gyjhf/output.out?rlkey=0fn4gfbkw8i9alxsy35cnj1u4&st=ooqgezvf&dl=0__;!!HXCxUKc!xM7mDNY2grp5ScV8pQdyAs5H3zPhWa-52-KzIh01U9wxUsXYJwgMM_4VrGidHP5TsWgv6GRPFsPulA$

This is the exact line I used to compile amber on that platform:
```
cmake ../amber -DCMAKE_INSTALL_PREFIX=../install_lbsr_dev -DCOMPILER=GNU -DMPI=TRUE -DCUDA=TRUE -DINSTALL_TESTS=TRUE -DDOWNLOAD_MINICONDA=FALSE -DBUILD_PYTHON=ON -DNVIDIA_MATH_LIBS=${TACC_NVIDIA_MATH_LIB} –DBUILD_QUICK=OFF
```

Some additional ideas:

Next to the `pmemd.cuda` link, you'll find a pmemd.cuda_DPFP. Try to run your example with that binary.

Another thing you could try, if you're willing to, is go to the file amber/src/pmemd/src/cuda/gti_cuda.cu.

At line 1309 you'll find:

```
  unsigned threadsPerBlock = 768;
  unsigned factor = (PASCAL) ? 1 : 1;
  unsigned blocksToUse = (isDPFP) ? gpu->blocks : min((nterms / threadsPerBlock) + 1,
                                                      gpu->blocks*factor);
```

Replace that with:

```
  unsigned threadsPerBlock = 128;
  unsigned blocksToUse = 16;
  printf("gpu->blocks: %d\n", gpu->blocks); // Just out of curiosity. This should be equal to the number of SMs of the GH200
```

And see if that fixes it.
If any of these extra things fixes it, then it may mean that the cuda runtime is launching additional threads along with each kernel or occupying extra (shared?) memory. Make sure Amber wasn't compiled with debug symbols, and that there isn't any other process running on the same GPU at the same time while you're testing.

That's all I can think of.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org<mailto:AMBER.ambermd.org>
https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!!HXCxUKc!xM7mDNY2grp5ScV8pQdyAs5H3zPhWa-52-KzIh01U9wxUsXYJwgMM_4VrGidHP5TsWgv6GR0UUHLTg$
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 24 2025 - 15:00:02 PDT