Re: [AMBER] Random segfaults (invalid memory reference) in sander

From: James Kress <jimkress_58.kressworks.org>
Date: Thu, 2 May 2019 13:28:48 -0400

Have you tested your RAM? MEMTEST would be a good place to start. In my
experience random segfaults are often attributable to one or more bad DIMMs
of RAM.

Jim Kress

-----Original Message-----
From: Charo del Genio <the.paraw.gmail.com>
Sent: Thursday, May 02, 2019 4:43 AM
To: amber.ambermd.org
Subject: [AMBER] Random segfaults (invalid memory reference) in sander

Dear all,
        for a while, I have been experiencing strange crashes with sander.
In all cases, I get an error such as

Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.


Backtrace for this error:
#0 0x7f5d949cc50f in ???
#1 0x7f5d89d4e16b in ???
#2 0x7f5d9444a7fb in ???
#3 0x7f5d8943162c in ???
#4 0x7f5d95a0311d in ???
#5 0x7f5d95cced72 in ???
#6 0x57b8e7 in ???
#7 0x560471 in ???
#8 0x509ddc in ???
#9 0x4fd944 in ???
#10 0x4fda62 in ???
#11 0x7f5d949b8d1c in ???
#12 0x475a28 in ???
--------------------------------------------------------------------------
mpirun noticed that process rank 27 with PID 4386 on node zeus338 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------


After experimenting a bit, a list of symptoms is:

1) The crashes appear to be entirely random, in that the very same run, if
repeated, crashes after a different number of steps.
2) The crashes seem unrelated to I/O, as the output files, the restart
files, and the trajectory files are all good and uncorrupted.
3) The crashes only happen when running simulations with explicit solvent;
they do not happen when igb>0.
4) The crashes are independent on the system.
5) So far, I've only had them in constant-pressure runs, but I wouldn't
venture so far as to say that they do not happen in constant-volume jobs.
6) The crashes are independent on the barostat used.
7) The crashes are independent on the solvation box geometry (box vs.
truncated octahedral).
8) So far, I've only experienced them when running parallel jobs, but given
their apparent randomness, I don't think that testing with a one-cpu serial
job is a viable option.
9) They happen with both AmberTools18 and AmberTools19.


Just for sake of completeness, the systems are prepared with leap, using
ff14SB and TIP3P water, and my typical mdin looks like this:

  &cntrl
   imin=0, ntx=5, irest=1,
   ntpr=500, ntwx=500, ntwr=500,
   ntb=2, cut=8.0, igb=0,
   rgbmax=8.0,
   ntp=2, taup=1.0, barostat=2,
   nstlim=10000000, dt=0.002,
   ntt=3, temp0=295.15, ig=-1, gamma_ln=2.0,
   ntc=2, ntf=2,
  /

Note that, as stated above, some options may be different (ntp=1,
barostat=1, presence of ntwprt, etc.).
In case you wonder, I checked all systems that fail with cpptraj for bad
contacts, and didn't find any.


One other thing to say is that up to July 2018 I did not experience these
problems. After that, the only updates that touched sander were update 8 and
10, so maybe the problem lies there. Please note, I'm not saying there was
something wrong in these updates; at this stage,mine is just a vague
conjecture.

For compilation, I'm using gcc-7.1.0 with openmpi-2.1.5.


Any hints or indications on how to debug this would be greatly appreciated.


Cheers,

Charo




--
Dr. Charo I. del Genio
Senior Lecturer in Statistical Physics
Applied Mathematics Research Centre (AMRC) Design Hub Coventry University
Technology Park Coventry CV1 5FB UK
https://charodelgenio.weebly.com
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 02 2019 - 10:30:02 PDT
Custom Search