[AMBER] Help with MPI error: “MPI size is not a multiple of -ng” on ADA cluster (Amber24, T-REMD)

From: Rahul Singal via AMBER <amber.ambermd.org>
Date: Sat, 5 Jul 2025 11:44:56 +0000

Dear Amber users,
I am Rahul Singal, a fourth-year undergraduate student working on MD simulations with Amber24.
I am trying to run temperature replica exchange MD (T-REMD) simulations on the ADA HPC cluster, following the T-REMD tutorial exactly as described in the Amber24 manual.
I submit my job with the following SLURM script:
#!/bin/bash
#SBATCH --output=eqb_job.out
#SBATCH --error=eqb_job.err
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=4
#SBATCH --time=96:00:00
#SBATCH --partition=u22
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=rahul.singal.research.iiit.ac.in
#SBATCH --nodelist=gnode026

module load u22/amber

srun pmemd.cuda.MPI -ng 4 -groupfile groupfile -rem 1

However, when I run this, I get the following error:
Loading u22/amber/24
  Loading requirement: u22/openmpi/5.0.7-cuda-12.4 u22/cuda/12.4
setup_groups: MPI size is not a multiple of -ng
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[25830,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
setup_groups: MPI size is not a multiple of -ng
setup_groups: MPI size is not a multiple of -ng
setup_groups: MPI size is not a multiple of -ng
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[26204,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[5676,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[13201,0],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: gnode026: tasks 0-2: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2455491.0
slurmstepd: error: *** STEP 2455491.0 ON gnode026 CANCELLED AT 2025-07-05T16:26:21 ***
srun: error: gnode026: task 3: Exited with exit code 1

>From the error, it seems the total number of MPI processes isn’t matching properly with -ng 4.
I would like to ask:

  1.
Is there a recommended way to configure srun or mpirun under SLURM for T-REMD in Amber24?
  2.
Should --ntasks always match the number of groups specified by -ng?
  3.
Could this issue be related to how MPI is set up on the ADA cluster?

Any guidance or suggestions would be greatly appreciated, as I am following the tutorial closely but still encounter this problem.
Thank you very much for your help!
Best regards,
Rahul Singal

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jul 05 2025 - 05:00:03 PDT
Custom Search