[AMBER] segmentation faults and randomly stopping simulations from Ryan Woltz via AMBER on 2024-10-10 (Amber Archive Oct 2024)

From: Ryan Woltz via AMBER <amber.ambermd.org>
Date: Thu, 10 Oct 2024 15:16:26 -0700

Dear Community,

      I hope this finds you well. My issue is a follow-up on this post and
with a few additional steps.

http://archive.ambermd.org/202106/0032.html

I am getting this error during minimization:

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.

Backtrace for this error:
#0 0x7cfb6b623960 in ???
#1 0x7cfb6b622ac5 in ???
#2 0x7cfb6b24251f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x55e1d96b6236 in ???
#4 0x55e1d96cb224 in ???
#5 0x55e1d96fff32 in ???
#6 0x55e1d9730242 in ???
#7 0x55e1d95fb0be in ???
#8 0x7cfb6b229d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#9 0x7cfb6b229e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#10 0x55e1d95fb0f4 in ???
#11 0xffffffffffffffff in ???
./run_simulation_PMEMD_local.sh: line 76: 2140490 Segmentation fault
(core dumped) ${amberCPU} -O -i ${mini_prefix}.mdin -p ${init}.parm7 -c
${init}.rst7 -o ${mini_prefix}.mdout -r ${mini_prefix}.rst7 -inf
${mini_prefix}.mdinfo -ref ${init}.rst7

I've received this error with AMBER24 and AMBER 22 on both a local computer
and a cluster with no issues in other small simulations. I have received
this error very early (NSTEP 100) for a 1.7M atom and in a smaller 800K
atom system. These systems are a dimer and pentamer of a single ion channel
that I've run very well for the last 4 years (300K atoms for the single).
I checked the mdout file and the final section is posted here:

  NSTEP ENERGY RMS GMAX NAME NUMBER
   3300 -2.4631E+06 4.5192E-01 2.2351E+02 C210 202943

BOND = 89542.2164 ANGLE = 76808.1827 DIHED =
109318.4337
UB = 12522.2692 IMP = 916.5317 CMAP =
-1846.3319
VDWAALS = 225907.9402 EEL = -3020374.2635 HBOND =
0.0000
1-4 VDW = 21258.8594 1-4 EEL = 18791.0156 RESTRAINT =
4098.4033
EAMBER = -2467155.1464

As in the post I checked the bonds which is in the 90K range but the single
channel is in the 40K range at the same step so this number makes sense for
the dimer with more than double the number of atoms. Every time I run the
system it fails at a different stage so it seems like a random event,
except for the 1.7M atom runs. That fails very quickly. I've also
checkstructure with cpptraj (although I never use it so I could've made a
mistake) and it seems to pass with no warnings. I'm monitoring my RAM which
has 128GB and it doesn't seem to exceed that. is there a number in this
table above that is excessively high? Do I just need better hardware to
handle the larger system? I'm using a fairly decent i9 13gen with a 4090.

While I'm here I also do have an issue with a single channel simulation
that seems to be relatable. I run my simulations with 1ns steps with step6
being equilibration and step7.X being the production. X is the number of ns
it has completed. I run this in a loop with 65 ns being completed at a
time. A handful of steps end earl in the mdout file. However, it completes
enough data to use the files for that step to run the next step. So I'll
end up with 65 steps completed with steps 3, 4, 17, 34, and 55 ending
early. I created this simulation by copying a successful simulation and
deleting the data.to do a replicate run. The original ran for 250 ns total
with no problems and I've never had this happen with this channel in the
3-4 years I've been running this channel. Additionally, I've made multiple
copies of this simulation and ran it only for it to fail at various point
with the first failure occuring before 20 ns but randomly. I've also had
this happen on AMBER 22 and 24 and with various computers. It is so strange
that a system would run nicely then fail on every other rerun of that
system. I found this error but it isn't helpful to me: I did do some
pocking about and I think it has something to do with CUDA but since I've
using various AMBER22 versions and CUDA versions and different computers it
might be my system but I don't know how to troubleshoot this or what to
change.

of length = 42Failed an illegal memory access was encountered

Since these both dealt with what appears to be memory violations I thought
I'd clump bot errors together. Thank you to anyone who can help,

Ryan
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Oct 10 2024 - 15:30:02 PDT