Re: [AMBER] segmentation faults and randomly stopping simulations from Ryan Woltz via AMBER on 2024-10-10 (Amber Archive Oct 2024)

From: Ryan Woltz via AMBER <amber.ambermd.org>
Date: Thu, 10 Oct 2024 15:35:14 -0700

Thank you Masoud for such a fast reply. For the first issue I am using
pmemd. To run in serial do I need to build a specific pmemd.serial version?
Looking at htop it is only using a single thread.

This is my running command

${amberCPU} -O -i ${mini_prefix}.mdin -p ${init}.parm7 -c ${init}.rst7 -o
${mini_prefix}.mdout -r ${mini_prefix}.rst7 -inf ${mini_prefix}.mdinfo -ref
${init}.rst7

Where amberCPU=pmemd

Thank you,

Ryan

On Thu, Oct 10, 2024 at 3:25 PM Masoud Keramati <keramati.m.northeastern.edu>
wrote:

> Hi Ryan,
>
> Have you tried to run AMBER in serial?
> It may take much longer time but could be helpful.
>
> Best,
>
> Masoud
> ------------------------------
> *From:* Ryan Woltz via AMBER <amber.ambermd.org>
> *Sent:* Thursday, October 10, 2024 18:16
> *To:* AMBER Mailing List <amber.ambermd.org>
> *Subject:* [AMBER] segmentation faults and randomly stopping simulations
>
> Dear Community,
>
> I hope this finds you well. My issue is a follow-up on this post and
> with a few additional steps.
>
>
> https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Farchive.ambermd.org%2F202106%2F0032.html&data=05%7C02%7Ckeramati.m%40northeastern.edu%7Cf40691b419ec43e6334b08dce9795005%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C638641954462541616%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q19AGVCVbI7EQfFq1dMfLWnIqWtRG13%2FVcFV8n3Tciw%3D&reserved=0
> <http://archive.ambermd.org/202106/0032.html>
>
> I am getting this error during minimization:
>
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory
> reference.
>
> Backtrace for this error:
> #0 0x7cfb6b623960 in ???
> #1 0x7cfb6b622ac5 in ???
> #2 0x7cfb6b24251f in ???
> at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
> #3 0x55e1d96b6236 in ???
> #4 0x55e1d96cb224 in ???
> #5 0x55e1d96fff32 in ???
> #6 0x55e1d9730242 in ???
> #7 0x55e1d95fb0be in ???
> #8 0x7cfb6b229d8f in __libc_start_call_main
> at ../sysdeps/nptl/libc_start_call_main.h:58
> #9 0x7cfb6b229e3f in __libc_start_main_impl
> at ../csu/libc-start.c:392
> #10 0x55e1d95fb0f4 in ???
> #11 0xffffffffffffffff in ???
> ./run_simulation_PMEMD_local.sh: line 76: 2140490 Segmentation fault
> (core dumped) ${amberCPU} -O -i ${mini_prefix}.mdin -p ${init}.parm7 -c
> ${init}.rst7 -o ${mini_prefix}.mdout -r ${mini_prefix}.rst7 -inf
> ${mini_prefix}.mdinfo -ref ${init}.rst7
>
> I've received this error with AMBER24 and AMBER 22 on both a local computer
> and a cluster with no issues in other small simulations. I have received
> this error very early (NSTEP 100) for a 1.7M atom and in a smaller 800K
> atom system. These systems are a dimer and pentamer of a single ion channel
> that I've run very well for the last 4 years (300K atoms for the single).
> I checked the mdout file and the final section is posted here:
>
> NSTEP ENERGY RMS GMAX NAME NUMBER
> 3300 -2.4631E+06 4.5192E-01 2.2351E+02 C210 202943
>
> BOND = 89542.2164 ANGLE = 76808.1827 DIHED =
> 109318.4337
> UB = 12522.2692 IMP = 916.5317 CMAP =
> -1846.3319
> VDWAALS = 225907.9402 EEL = -3020374.2635 HBOND =
> 0.0000
> 1-4 VDW = 21258.8594 1-4 EEL = 18791.0156 RESTRAINT =
> 4098.4033
> EAMBER = -2467155.1464
>
> As in the post I checked the bonds which is in the 90K range but the single
> channel is in the 40K range at the same step so this number makes sense for
> the dimer with more than double the number of atoms. Every time I run the
> system it fails at a different stage so it seems like a random event,
> except for the 1.7M atom runs. That fails very quickly. I've also
> checkstructure with cpptraj (although I never use it so I could've made a
> mistake) and it seems to pass with no warnings. I'm monitoring my RAM which
> has 128GB and it doesn't seem to exceed that. is there a number in this
> table above that is excessively high? Do I just need better hardware to
> handle the larger system? I'm using a fairly decent i9 13gen with a 4090.
>
> While I'm here I also do have an issue with a single channel simulation
> that seems to be relatable. I run my simulations with 1ns steps with step6
> being equilibration and step7.X being the production. X is the number of ns
> it has completed. I run this in a loop with 65 ns being completed at a
> time. A handful of steps end earl in the mdout file. However, it completes
> enough data to use the files for that step to run the next step. So I'll
> end up with 65 steps completed with steps 3, 4, 17, 34, and 55 ending
> early. I created this simulation by copying a successful simulation and
> deleting the data.to do a replicate run. The original ran for 250 ns total
> with no problems and I've never had this happen with this channel in the
> 3-4 years I've been running this channel. Additionally, I've made multiple
> copies of this simulation and ran it only for it to fail at various point
> with the first failure occuring before 20 ns but randomly. I've also had
> this happen on AMBER 22 and 24 and with various computers. It is so strange
> that a system would run nicely then fail on every other rerun of that
> system. I found this error but it isn't helpful to me: I did do some
> pocking about and I think it has something to do with CUDA but since I've
> using various AMBER22 versions and CUDA versions and different computers it
> might be my system but I don't know how to troubleshoot this or what to
> change.
>
> of length = 42Failed an illegal memory access was encountered
>
> Since these both dealt with what appears to be memory violations I thought
> I'd clump bot errors together. Thank you to anyone who can help,
>
> Ryan
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
>
> https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ambermd.org%2Fmailman%2Flistinfo%2Famber&data=05%7C02%7Ckeramati.m%40northeastern.edu%7Cf40691b419ec43e6334b08dce9795005%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C638641954462561766%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=IbdzxbTbHN%2FSCan3gnw27IACWW%2FhU1Gjs0zI8vQJ4cA%3D&reserved=0
> <http://lists.ambermd.org/mailman/listinfo/amber>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Oct 10 2024 - 16:00:02 PDT