Re: [AMBER] possible bug - SIGSEGV in cpptraj called from MMPBSA.py

From: Daniel Roe via AMBER <amber.ambermd.org>
Date: Tue, 13 Dec 2022 12:14:11 -0500

Hi,

So the valgrind run finally (!) finished this morning. It looks like
what is happening is there is a buffer overflow in the ASCII
trajectory write routine caused by corruption in the original
trajectory (E81A.nc) - specifically, frames 221453 to 221459. The
corruption (at least part of it) can be seen with the following
cpptraj input:

parm ../E81A.top
trajin ../E81A.nc 221452 221460 1
vector box out box.dat

which produces:
#Frame Vec_00001
       1 83.0102 83.0102 83.0102 0.0000 0.0000 0.0000
       2 4444084240384.0000 16788542717952.0000 1224259221323776.0000
 0.0000 0.0000 0.0000
       3 100160264732672.0000 11091151159296.0000 24665951043584.0000
 0.0000 0.0000 0.0000
       4 55587589062656.0000 587061497167872.0000 13347418275840.0000
 0.0000 0.0000 0.0000
       5 671196250112.0000 23452807331840.0000 167830309830656.0000
0.0000 0.0000 0.0000
       6 336910992015360.0000 2308333633536.0000 25847301931008.0000
0.0000 0.0000 0.0000
       7 15361783103488.0000 2268600991744.0000 10823411957760.0000
0.0000 0.0000 0.0000
       8 3551720374272.0000 90942996480000.0000 520938127360.0000
0.0000 0.0000 0.0000
       9 83.0033 83.0033 83.0033 0.0000 0.0000 0.0000

Frames 2-8 clearly have problems with the box lengths. I'm using the
'check' command now to check for bad overlaps, stretched bonds, etc.,
and it seems like there may be some more corruption later in the
trajectory. Unfortunately the check is slow; the unit cell corruption
makes the 'check' pair list work improperly (which is also something I
need to fix) so I need to disable imaging. Right now I would recommend
using a truncated version of that trajectory (frames 1 to 221452) for
your analysis. I'll work on fixing the bugs in cpptraj in the meantime
(even though the trajectory is corrupt, cpptraj should both handle it
more gracefully and be more informative).

Thanks for the interesting test case! :-)

-Dan

On Fri, Dec 9, 2022 at 10:11 AM Andrzej Dorobisz via AMBER
<amber.ambermd.org> wrote:
>
> Dear Dan,
> Thank you for investigating this bug. In our core dump we got exactly
> the same values you pasted here (75042866, 3158064, ... at the beginning
> of the Selected_ vector in atom mask object).
>
> I hope you will manage to find the cause of this memory corruption.
>
> Andrzej
>
>
> On 9.12.2022 14:56, Daniel Roe via AMBER wrote:
> > OK - so I was able to reproduce the bug, and it does seem like it's a
> > memory overwrite issue. I'm running an extensive valgrind memcheck to
> > try to pinpoint the exact cause now.
> >
> > What is happening is that the selected atoms array (which contains the
> > indices of each selected atom) in the atom mask in the RMS action is
> > being corrupted somehow. Here you can see the first two elements are
> > clearly incorrect (it should look like 0, 1, 2, 3...):
> >
> > (gdb) print tgtMask_.Selected_
> > $12 = std::vector of length 9280, capacity 16384 = {775042866,
> > 3158064, 2, 3, 4, 5, 6, 7, 8, 9,
> >
> > There is almost no way this could happen without some sort of memory
> > corruption since the routine that sets up the selected array
> > (Selected_) looks like this (AtomMask.cpp):
> >
> > Selected_.clear();
> > for (int atom = 0; atom != Natom_; atom++) {
> > if (charmask[atom] == maskChar_)
> > Selected_.push_back( atom );
> > }
> >
> > When subsequent routines try to use this corrupted mask they hit the
> > huge first index which is way out of range (in a 9280 atom system)
> > which is what triggers the segfault that actually stops execution.
> >
> > Unfortunately one of the downsides to valgrind being thorough is that
> > it is also slow. I've had the run going overnight and nothing has
> > triggered yet. I'll keep you up to date with what I find.
> >
> > -Dan
> >
> > On Thu, Dec 8, 2022 at 10:17 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >> Thanks, I'm downloading it now. I was able to run the given input with
> >> cpptraj on the shorter trajectory you provided with no issues;
> >> valgrind showed no memory errors. This is starting to feel like an
> >> out-of-memory type issue, but I will keep digging.
> >>
> >> I'm already seeing some areas where quality of life improvements can
> >> be made to cpptraj (e.g. every frame does not need to be printed to
> >> stdout for 'onlyframes' etc).
> >>
> >> I'll report when/if I find anything. Thanks for the files.
> >>
> >> -Dan
> >>
> >> On Thu, Dec 8, 2022 at 8:51 AM Andrzej Dorobisz via AMBER
> >> <amber.ambermd.org> wrote:
> >>> Hi,
> >>> I just uploaded the input data (22 GB) so you can download and test
> >>> cpptraj on it.
> >>>
> >>> - file E81A.nc
> >>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9
> >>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9>
> >>>
> >>> - all other files (E81A.top + input-cpptraj.txt)
> >>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70
> >>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70>
> >>>
> >>> The error occurs after about 1 hour and 30 minutes.
> >>>
> >>>
> >>> Regards,
> >>> Andrzej
> >>>
> >>>
> >>> On 6.12.2022 16:43, Daniel Roe via AMBER wrote:
> >>>> Hi,
> >>>>
> >>>> On Tue, Dec 6, 2022 at 10:21 AM Andrzej Dorobisz
> >>>> <andrzej.dorobisz.cyfronet.krakow.pl> wrote:
> >>>>> Thank you for the quick reply. I can send the topology file but I don't
> >>>>> know how to extract frames from the trajectory file (../input/E81A.nc).
> >>>> The input would be something like:
> >>>>
> >>>> parm E81A.top
> >>>> trajin E81A.nc 1 10
> >>>> trajout E81A.1-10.nc
> >>>>
> >>>> -Dan
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Dec 13 2022 - 09:30:02 PST
Custom Search