[AMBER] Simulation not running on more than one MPI process with 12-6-4 parameters

From: Markowska <amber.ambermd.org>
Date: Mon, 11 Mar 2024 08:10:17 +0100

Dear Amber users,

I notice a possible bug related to pmemd.cuda.SPFP.MPI. In my system of
interests I have several Mg2+ ions, which I wanted to reparametrize using
Panteva et al. 12-6-4 parameters. I followed the example B of this
tutorial: https://ambermd.org/tutorials/advanced/tutorial20/12_6_4.php and
obtained new .inpcrd and .prmtop files.
For control, I also run a simulation of the same system, but without the
12-6-4 parameters. In my simulation workflow I'm using pmemd.cuda.SPFP.MPI
for heating, equilibration and production to run simulation on 4 GPUs.
The command I use for running heating looks like the following:
mpirun -np 4 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -O -i heat.in -o heat.out
-p $PARM -c $RST -ref $RST -r $NAME-rst.nc -x $NAME-trj.nc

And while my control system runs without any problems, the system with
applied 12-6-4 parameters throws up several errors (see below). The only
solution I came up with was changing the number of MPI processes to 1 -
then it runs fine, but of course much slower. What may be causing this
error? And is it possible to fix it quickly?

[lrdn2425:779134:0:779134] Caught signal 11 (Segmentation fault: address
not mapped to object at address 0x1)
==== backtrace (tid: 779134) ====
 0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14bcd5a72edc]
 1 /lib64/libucs.so.0(+0x2b0bc) [0x14bcd5a730bc]
 2 /lib64/libucs.so.0(+0x2b28a) [0x14bcd5a7328a]
 3 /lib64/libpthread.so.0(+0x12cf0) [0x14bd16b58cf0]
 4
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x6bd59e]
 5
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x4d6dfc]
 6
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x4693f3]
 7
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x46e381]
 8
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x5554fe]
 9
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x52ff07]
10
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x41385d]
11 /lib64/libc.so.6(__libc_start_main+0xe5) [0x14bd15689d85]
12
 /leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-11.3.0/amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/bin/pmemd.cuda_SPFP.MPI()
[0x42cc3e]
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
#0 0x14bd16b58cef in ???
#1 0x6bd59e in gti_lj1264_nb_setup_
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/cuda/gti_f95.cpp:553
#2 0x4d6dfb in __extra_pnts_nb14_mod_MOD_nb14_setup
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/extra_pnts_nb14.F90:540
#3 0x4693f2 in do_atm_distribution
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/parallel.F90:2214
#4 0x46e380 in __parallel_mod_MOD_parallel_setup
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/parallel.F90:339
#5 0x5554fd in __pme_alltasks_setup_mod_MOD_pme_alltasks_setup
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/pme_alltasks_setup.F90:184
#6 0x52ff06 in pmemd
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/pmemd.F90:518
#7 0x41385c in main
at
/dev/shm/propro01/spack-stage-amber-22-cz7v3y4nrcoxnjsgdwukvsexakhx2k5k/spack-src/src/pmemd/src/pmemd.F90:77
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 779134 on node lrdn2425 exited
on signal 11 (Segmentation fault).

Looking forward to hearing from you.
Best regards,
Karolina MitusiƄska
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 11 2024 - 00:30:03 PDT
Custom Search