[AMBER] Segmentation fault when running pmemd.cuda.MPI in Amber22 with plumed 2.8.2

From: Tien Phan via AMBER <amber.ambermd.org>
Date: Wed, 29 Mar 2023 16:38:13 -0500

Hello everyone,

I am simulating Parallel Tempering - Well Tempered Ensemble (PT-WTE) using
Amber22 and plumed 2.8.2. This method combines parallel tempering and
metadynamics applied to the total potential energy of the simulation box.
https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.104.190601

I first tested the method using CPUs, and it worked. I wanted to increase
the performance by running the simulations on GPUs using pmemd.cuda.MPI. I
ran 12 replicas on 12 A40-GPUs. The simulations failed because of the
Segmentation fault error (shown below). I tried running Parallel Tempering
(RMED), and the simulations worked fine. This error occurred when
setting plumed=1.

I saw a post reporting the same issue in amber20 when running umbrella
sampling REMD with plumed.
http://archive.ambermd.org/202007/0155.html

Do you happen to know what the issue could be? Does the problem come from
Plumed or Amber? Is there a way to fix this?

Thank you,
Tien Phan

--------
 Running multipmemd version of pmemd Amber22
    Total processors = 12
    Number of groups = 12

+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/scratch/group/plumed-2.8.2/src/lib/libplumedKernel.so"
+++
[g104:116353:0:116353] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g104:116354:0:116354] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g104:116355:0:116355] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g107:223076:0:223076] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g108:249023:0:249023] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g105:9061 :0:9061] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g105:9059 :0:9059] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g107:223075:0:223075] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g108:249022:0:249022] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g105:9060 :0:9060] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g107:223077:0:223077] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
[g108:249021:0:249021] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x30)
==== backtrace (tid: 9061) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
==== backtrace (tid: 9060) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================
10 0x00000000004d847d main() ???:0

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
==== backtrace (tid: 249023) ====
==== backtrace (tid: 9059) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
==== backtrace (tid: 249022) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
 0 0x0000000000064e94 MPI_Allreduce() ???:0
=================================
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
==== backtrace (tid: 223077) ====
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
12 0x00000000004f17be _start() ???:0
=================================
 0 0x0000000000064e94 MPI_Allreduce() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0

Backtrace for this error:
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
==== backtrace (tid: 223076) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================
==== backtrace (tid: 249021) ====

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
12 0x00000000004f17be _start() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
=================================
 5 0x0000000000711f46 plumed_gcmd() ???:0

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0

Backtrace for this error:
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
==== backtrace (tid: 223075) ====
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
10 0x00000000004d847d main() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
12 0x00000000004f17be _start() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
=================================
 5 0x0000000000711f46 plumed_gcmd() ???:0

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0

Backtrace for this error:
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
==== backtrace (tid: 116354) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
==== backtrace (tid: 116355) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
==== backtrace (tid: 116353) ====
 0 0x0000000000064e94 MPI_Allreduce() ???:0
 1 0x0000000000457d00 PLMD::GREX::cmd() ???:0
 2 0x000000000046c7ec PLMD::PlumedMain::cmd() ???:0
 3 0x000000000047011b plumed_plumedmain_cmd() ???:0
 4 0x0000000000711b1b plumed_cmd() ???:0
 5 0x0000000000711f46 plumed_gcmd() ???:0
 6 0x0000000000712e99 plumed_f_gcmd_static() Plumed.c:0
 7 0x0000000000712ddb plumed_f_gcmd_() ???:0
 8 0x000000000059c709 __runmd_mod_MOD_runmd() ???:0
 9 0x00000000005e29f3 MAIN__() pmemd.F90:0
10 0x00000000004d847d main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x00000000004f17be _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
#0 0x2b13c50d562f in ???
#0 0x2b22ce0c262f in ???
#1 0x2b13c3d05e94 in ???
#2 0x2b1405d43cff in ???
#3 0x2b1405d587eb in ???
#4 0x2b1405d5c11a in ???
#5 0x711b1a in ???
#1 0x2b22cccf2e94 in ???
#6 0x711f45 in ???
#2 0x2b230ef5acff in ???
#7 0x712e98 in ???
#3 0x2b230ef6f7eb in ???
#4 0x2b230ef7311a in ???
#8 0x712dda in ???
#0 0x2ba34bf3f62f in ???
#9 0x59c708 in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#1 0x2ba34ab6fe94 in ???
#7 0x712e98 in ???
#10 0x5e29f2 in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#11 0x4d847c in ???
#12 0x2b13c5a26554 in ???
#2 0x2ba38d769cff in ???
#3 0x2ba38d77e7eb in ???
#4 0x2ba38d78211a in ???
#12 0x2b22cea13554 in ???
#13 0x4f17bd in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#14 0xffffffffffffffff in ???
#7 0x712e98 in ???
#0 0x2af14cfc062f in ???
#1 0x2af14bbf0e94 in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#12 0x2ba34c890554 in ???
#0 0x2aec0ffca62f in ???
#2 0x2af18da41cff in ???
#3 0x2af18da567eb in ???
#4 0x2af18da5a11a in ???
#1 0x2aec0ebfae94 in ???
#2 0x2aec44ab9cff in ???
#3 0x2aec44ace7eb in ???
#4 0x2aec44ad211a in ???
#5 0x711b1a in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#7 0x712e98 in ???
#6 0x711f45 in ???
#8 0x712dda in ???
#7 0x712e98 in ???
#9 0x59c708 in ???
#8 0x712dda in ???
#10 0x5e29f2 in ???
#9 0x59c708 in ???
#11 0x4d847c in ???
#10 0x5e29f2 in ???
#12 0x2af14d911554 in ???
#11 0x4d847c in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#12 0x2aec1091b554 in ???
#0 0x2b1671ed362f in ???
#13 0x4f17bd in ???
#1 0x2b1670b03e94 in ???
#2 0x2b16b2d45cff in ???
#0 0x2add2e28462f in ???
#14 0xffffffffffffffff in ???
#3 0x2b16b2d5a7eb in ???
#1 0x2add2ceb4e94 in ???
#0 0x2b0a90a0c62f in ???
#1 0x2b0a8f63ce94 in ???
#2 0x2b0ac5a58cff in ???
#3 0x2b0ac5a6d7eb in ???
#4 0x2b0ac5a7111a in ???
#2 0x2add6f106cff in ???
#4 0x2b16b2d5e11a in ???
#3 0x2add6f11b7eb in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#4 0x2add6f11f11a in ???
#5 0x711b1a in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#6 0x711f45 in ???
#0 0x2abfe2a0e62f in ???
#7 0x712e98 in ???
#7 0x712e98 in ???
#7 0x712e98 in ???
#8 0x712dda in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#8 0x712dda in ???
#10 0x5e29f2 in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#9 0x59c708 in ???
#11 0x4d847c in ???
#12 0x2b1672824554 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#12 0x2b0a9135d554 in ???
#1 0x2abfe163ee94 in ???
#11 0x4d847c in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#12 0x2add2ebd5554 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#2 0x2ac023496cff in ???
#13 0x4f17bd in ???
#3 0x2ac0234ab7eb in ???
#4 0x2ac0234af11a in ???
#14 0xffffffffffffffff in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#7 0x712e98 in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#12 0x2abfe335f554 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#0 0x2b8b85e2c62f in ???
#1 0x2b8b84a5ce94 in ???
#2 0x2b8bc7542cff in ???
#3 0x2b8bc75577eb in ???
#0 0x2ae4b38af62f in ???
#1 0x2ae4b24dfe94 in ???
#2 0x2ae4f5538cff in ???
#3 0x2ae4f554d7eb in ???
#4 0x2ae4f555111a in ???
#4 0x2b8bc755b11a in ???
#5 0x711b1a in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#6 0x711f45 in ???
#7 0x712e98 in ???
#7 0x712e98 in ???
#8 0x712dda in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#12 0x2b8b8677d554 in ???
#11 0x4d847c in ???
#12 0x2ae4b4200554 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
#0 0x2aeaee0b362f in ???
#1 0x2aeaecce3e94 in ???
#2 0x2aeb2f41ccff in ???
#3 0x2aeb2f4317eb in ???
#4 0x2aeb2f43511a in ???
#5 0x711b1a in ???
#6 0x711f45 in ???
#7 0x712e98 in ???
#8 0x712dda in ???
#9 0x59c708 in ???
#10 0x5e29f2 in ???
#11 0x4d847c in ???
#12 0x2aeaeea04554 in ???
#13 0x4f17bd in ???
#14 0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node g104 exited on signal
11 (Segmentation fault).
--------------------------------------------------------------------------
3 total processes killed (some possibly by mpirun during cleanup)
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 29 2023 - 15:00:02 PDT
Custom Search