Re: [AMBER] Random segfaults (invalid memory reference) in sander from Daniel Roe on 2019-05-03 (Amber Archive May 2019)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Fri, 3 May 2019 08:27:15 -0400

Hi,

It's certainly possible there's a bug lurking somewhere.
Unfortunately, since it seems to be random tracking it down may be
tough. If you can, could you compile sander with debug symbols on
(configure with '-debug') and re-run a few times? This way the stack
trace reported after the crash should at least point us to the line in
the source code where bad things are happening.

-Dan

On Thu, May 2, 2019 at 4:43 AM Charo del Genio <the.paraw.gmail.com> wrote:
>
> Dear all,
> for a while, I have been experiencing strange crashes with sander. In all cases, I get an error such as
>
> Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
>
>
> Backtrace for this error:
> #0 0x7f5d949cc50f in ???
> #1 0x7f5d89d4e16b in ???
> #2 0x7f5d9444a7fb in ???
> #3 0x7f5d8943162c in ???
> #4 0x7f5d95a0311d in ???
> #5 0x7f5d95cced72 in ???
> #6 0x57b8e7 in ???
> #7 0x560471 in ???
> #8 0x509ddc in ???
> #9 0x4fd944 in ???
> #10 0x4fda62 in ???
> #11 0x7f5d949b8d1c in ???
> #12 0x475a28 in ???
> --------------------------------------------------------------------------
> mpirun noticed that process rank 27 with PID 4386 on node zeus338 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> After experimenting a bit, a list of symptoms is:
>
> 1) The crashes appear to be entirely random, in that the very same run, if repeated, crashes after a different number of steps.
> 2) The crashes seem unrelated to I/O, as the output files, the restart files, and the trajectory files are all good and uncorrupted.
> 3) The crashes only happen when running simulations with explicit solvent; they do not happen when igb>0.
> 4) The crashes are independent on the system.
> 5) So far, I've only had them in constant-pressure runs, but I wouldn't venture so far as to say that they do not happen in constant-volume jobs.
> 6) The crashes are independent on the barostat used.
> 7) The crashes are independent on the solvation box geometry (box vs. truncated octahedral).
> 8) So far, I've only experienced them when running parallel jobs, but given their apparent randomness, I don't think that testing with a one-cpu serial job is a viable option.
> 9) They happen with both AmberTools18 and AmberTools19.
>
>
> Just for sake of completeness, the systems are prepared with leap, using ff14SB and TIP3P water, and my typical mdin looks like this:
>
> &cntrl
> imin=0, ntx=5, irest=1,
> ntpr=500, ntwx=500, ntwr=500,
> ntb=2, cut=8.0, igb=0,
> rgbmax=8.0,
> ntp=2, taup=1.0, barostat=2,
> nstlim=10000000, dt=0.002,
> ntt=3, temp0=295.15, ig=-1, gamma_ln=2.0,
> ntc=2, ntf=2,
> /
>
> Note that, as stated above, some options may be different (ntp=1, barostat=1, presence of ntwprt, etc.).
> In case you wonder, I checked all systems that fail with cpptraj for bad contacts, and didn't find any.
>
>
> One other thing to say is that up to July 2018 I did not experience these problems. After that, the only updates that touched sander were update 8 and 10, so maybe the problem lies there. Please note,
> I'm not saying there was something wrong in these updates; at this stage,mine is just a vague conjecture.
>
> For compilation, I'm using gcc-7.1.0 with openmpi-2.1.5.
>
>
> Any hints or indications on how to debug this would be greatly appreciated.
>
>
> Cheers,
>
> Charo
>
>
>
>
> --
> Dr. Charo I. del Genio
> Senior Lecturer in Statistical Physics
>
> Applied Mathematics Research Centre (AMRC)
> Design Hub
> Coventry University Technology Park
> Coventry CV1 5FB
> UK
>
> https://charodelgenio.weebly.com
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 03 2019 - 05:30:02 PDT