Re: [AMBER] possible bug - SIGSEGV in cpptraj called from MMPBSA.py from Daniel Roe via AMBER on 2022-12-16 (Amber Archive Dec 2022)

From: Daniel Roe via AMBER <amber.ambermd.org>
Date: Fri, 16 Dec 2022 13:00:15 -0500

Hi,

I've made a couple of updates to cpptraj that will improve handling of
corrupted trajectories
(https://github.com/Amber-MD/cpptraj/pull/1009). The upshot is you can
use the 'check' action with the 'skipbadframes' option to filter out
the corrupted frames:

parm myparm.top
trajin corrupted.nc
check skipbadframes reportfile report.dat
trajout ok.nc

The OMP version should provide some speedup for the 'check' action.
Hope this helps,

-Dan

On Wed, Dec 14, 2022 at 4:44 AM Andrzej Dorobisz via AMBER
<amber.ambermd.org> wrote:
>
> Hi,
> Thank you very much for your work and discovering corruption in the
> trajectory file.
> We will investigate what is the cause of this. Please let us know if you
> do any fixes in cpptraj.
>
> Best regards,
> Andrzej
>
>
>
> On 13.12.2022 18:14, Daniel Roe via AMBER wrote:
> > Hi,
> >
> > So the valgrind run finally (!) finished this morning. It looks like
> > what is happening is there is a buffer overflow in the ASCII
> > trajectory write routine caused by corruption in the original
> > trajectory (E81A.nc) - specifically, frames 221453 to 221459. The
> > corruption (at least part of it) can be seen with the following
> > cpptraj input:
> >
> > parm ../E81A.top
> > trajin ../E81A.nc 221452 221460 1
> > vector box out box.dat
> >
> > which produces:
> > #Frame Vec_00001
> > 1 83.0102 83.0102 83.0102 0.0000 0.0000 0.0000
> > 2 4444084240384.0000 16788542717952.0000 1224259221323776.0000
> > 0.0000 0.0000 0.0000
> > 3 100160264732672.0000 11091151159296.0000 24665951043584.0000
> > 0.0000 0.0000 0.0000
> > 4 55587589062656.0000 587061497167872.0000 13347418275840.0000
> > 0.0000 0.0000 0.0000
> > 5 671196250112.0000 23452807331840.0000 167830309830656.0000
> > 0.0000 0.0000 0.0000
> > 6 336910992015360.0000 2308333633536.0000 25847301931008.0000
> > 0.0000 0.0000 0.0000
> > 7 15361783103488.0000 2268600991744.0000 10823411957760.0000
> > 0.0000 0.0000 0.0000
> > 8 3551720374272.0000 90942996480000.0000 520938127360.0000
> > 0.0000 0.0000 0.0000
> > 9 83.0033 83.0033 83.0033 0.0000 0.0000 0.0000
> >
> > Frames 2-8 clearly have problems with the box lengths. I'm using the
> > 'check' command now to check for bad overlaps, stretched bonds, etc.,
> > and it seems like there may be some more corruption later in the
> > trajectory. Unfortunately the check is slow; the unit cell corruption
> > makes the 'check' pair list work improperly (which is also something I
> > need to fix) so I need to disable imaging. Right now I would recommend
> > using a truncated version of that trajectory (frames 1 to 221452) for
> > your analysis. I'll work on fixing the bugs in cpptraj in the meantime
> > (even though the trajectory is corrupt, cpptraj should both handle it
> > more gracefully and be more informative).
> >
> > Thanks for the interesting test case! :-)
> >
> > -Dan
> >
> > On Fri, Dec 9, 2022 at 10:11 AM Andrzej Dorobisz via AMBER
> > <amber.ambermd.org> wrote:
> >> Dear Dan,
> >> Thank you for investigating this bug. In our core dump we got exactly
> >> the same values you pasted here (75042866, 3158064, ... at the beginning
> >> of the Selected_ vector in atom mask object).
> >>
> >> I hope you will manage to find the cause of this memory corruption.
> >>
> >> Andrzej
> >>
> >>
> >> On 9.12.2022 14:56, Daniel Roe via AMBER wrote:
> >>> OK - so I was able to reproduce the bug, and it does seem like it's a
> >>> memory overwrite issue. I'm running an extensive valgrind memcheck to
> >>> try to pinpoint the exact cause now.
> >>>
> >>> What is happening is that the selected atoms array (which contains the
> >>> indices of each selected atom) in the atom mask in the RMS action is
> >>> being corrupted somehow. Here you can see the first two elements are
> >>> clearly incorrect (it should look like 0, 1, 2, 3...):
> >>>
> >>> (gdb) print tgtMask_.Selected_
> >>> $12 = std::vector of length 9280, capacity 16384 = {775042866,
> >>> 3158064, 2, 3, 4, 5, 6, 7, 8, 9,
> >>>
> >>> There is almost no way this could happen without some sort of memory
> >>> corruption since the routine that sets up the selected array
> >>> (Selected_) looks like this (AtomMask.cpp):
> >>>
> >>> Selected_.clear();
> >>> for (int atom = 0; atom != Natom_; atom++) {
> >>> if (charmask[atom] == maskChar_)
> >>> Selected_.push_back( atom );
> >>> }
> >>>
> >>> When subsequent routines try to use this corrupted mask they hit the
> >>> huge first index which is way out of range (in a 9280 atom system)
> >>> which is what triggers the segfault that actually stops execution.
> >>>
> >>> Unfortunately one of the downsides to valgrind being thorough is that
> >>> it is also slow. I've had the run going overnight and nothing has
> >>> triggered yet. I'll keep you up to date with what I find.
> >>>
> >>> -Dan
> >>>
> >>> On Thu, Dec 8, 2022 at 10:17 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >>>> Thanks, I'm downloading it now. I was able to run the given input with
> >>>> cpptraj on the shorter trajectory you provided with no issues;
> >>>> valgrind showed no memory errors. This is starting to feel like an
> >>>> out-of-memory type issue, but I will keep digging.
> >>>>
> >>>> I'm already seeing some areas where quality of life improvements can
> >>>> be made to cpptraj (e.g. every frame does not need to be printed to
> >>>> stdout for 'onlyframes' etc).
> >>>>
> >>>> I'll report when/if I find anything. Thanks for the files.
> >>>>
> >>>> -Dan
> >>>>
> >>>> On Thu, Dec 8, 2022 at 8:51 AM Andrzej Dorobisz via AMBER
> >>>> <amber.ambermd.org> wrote:
> >>>>> Hi,
> >>>>> I just uploaded the input data (22 GB) so you can download and test
> >>>>> cpptraj on it.
> >>>>>
> >>>>> - file E81A.nc
> >>>>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9
> >>>>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9>
> >>>>>
> >>>>> - all other files (E81A.top + input-cpptraj.txt)
> >>>>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70
> >>>>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70>
> >>>>>
> >>>>> The error occurs after about 1 hour and 30 minutes.
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Andrzej
> >>>>>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Dec 16 2022 - 10:30:02 PST