Re: [AMBER] pmemd.cuda.MPI run stops (amber24)

From: David A Case via AMBER <amber.ambermd.org>
Date: Sat, 13 Dec 2025 09:23:29 -0700

On Fri, Dec 12, 2025, Dulal Mondal via AMBER wrote:

>I submit a REAF job using pmemd.cuda.MPI. But the error is
> Primary job terminated normally, but 1 process returned
>a non-zero exit code. Per user-direction, the job has been aborted.
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>mpirun detected that one or more processes exited with non-zero status,
>thus causing
>the job to be terminated. The first process to do so was:
>
> Process name: [[41136,1],2]
> Exit code: 255
>--------------------------------------------------------------------------
>and
>*cudaMemcpyToSymbol: SetSim copy to cSim failed invalid device symbol*

This message, and the MPI one, just indicate that some error occurred, but
offer no realy clues as to why.

Is there anything in the mdout file that looks suspicious. Does the code
work with the non-MPI version of pmemd.cuda? Is the fact that REAF is being
used relevant? (That is, do non-REAF jobs work OK?) Does that system work
OK with the CPU version of pmemd?

I think you will have to do some trial and error debugging to try to
localize the source of the problem.

>
>But amber 24 installation using cuda 11.7 and openmpi version 4.1.2 is
>successfully completed.

Does this imply that your ran the test suite (e.g. 'make test.cuda.serial')
successfully?

...good luck...dac

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Dec 13 2025 - 08:30:03 PST
Custom Search