Re: [AMBER] possible bug - SIGSEGV in cpptraj called from MMPBSA.py

From: Andrzej Dorobisz via AMBER <amber.ambermd.org>
Date: Wed, 14 Dec 2022 10:43:49 +0100

Hi,
Thank you very much for your work and discovering corruption in the
trajectory file.
We will investigate what is the cause of this. Please let us know if you
do any fixes in cpptraj.

Best regards,
Andrzej



On 13.12.2022 18:14, Daniel Roe via AMBER wrote:
> Hi,
>
> So the valgrind run finally (!) finished this morning. It looks like
> what is happening is there is a buffer overflow in the ASCII
> trajectory write routine caused by corruption in the original
> trajectory (E81A.nc) - specifically, frames 221453 to 221459. The
> corruption (at least part of it) can be seen with the following
> cpptraj input:
>
> parm ../E81A.top
> trajin ../E81A.nc 221452 221460 1
> vector box out box.dat
>
> which produces:
> #Frame Vec_00001
> 1 83.0102 83.0102 83.0102 0.0000 0.0000 0.0000
> 2 4444084240384.0000 16788542717952.0000 1224259221323776.0000
> 0.0000 0.0000 0.0000
> 3 100160264732672.0000 11091151159296.0000 24665951043584.0000
> 0.0000 0.0000 0.0000
> 4 55587589062656.0000 587061497167872.0000 13347418275840.0000
> 0.0000 0.0000 0.0000
> 5 671196250112.0000 23452807331840.0000 167830309830656.0000
> 0.0000 0.0000 0.0000
> 6 336910992015360.0000 2308333633536.0000 25847301931008.0000
> 0.0000 0.0000 0.0000
> 7 15361783103488.0000 2268600991744.0000 10823411957760.0000
> 0.0000 0.0000 0.0000
> 8 3551720374272.0000 90942996480000.0000 520938127360.0000
> 0.0000 0.0000 0.0000
> 9 83.0033 83.0033 83.0033 0.0000 0.0000 0.0000
>
> Frames 2-8 clearly have problems with the box lengths. I'm using the
> 'check' command now to check for bad overlaps, stretched bonds, etc.,
> and it seems like there may be some more corruption later in the
> trajectory. Unfortunately the check is slow; the unit cell corruption
> makes the 'check' pair list work improperly (which is also something I
> need to fix) so I need to disable imaging. Right now I would recommend
> using a truncated version of that trajectory (frames 1 to 221452) for
> your analysis. I'll work on fixing the bugs in cpptraj in the meantime
> (even though the trajectory is corrupt, cpptraj should both handle it
> more gracefully and be more informative).
>
> Thanks for the interesting test case! :-)
>
> -Dan
>
> On Fri, Dec 9, 2022 at 10:11 AM Andrzej Dorobisz via AMBER
> <amber.ambermd.org> wrote:
>> Dear Dan,
>> Thank you for investigating this bug. In our core dump we got exactly
>> the same values you pasted here (75042866, 3158064, ... at the beginning
>> of the Selected_ vector in atom mask object).
>>
>> I hope you will manage to find the cause of this memory corruption.
>>
>> Andrzej
>>
>>
>> On 9.12.2022 14:56, Daniel Roe via AMBER wrote:
>>> OK - so I was able to reproduce the bug, and it does seem like it's a
>>> memory overwrite issue. I'm running an extensive valgrind memcheck to
>>> try to pinpoint the exact cause now.
>>>
>>> What is happening is that the selected atoms array (which contains the
>>> indices of each selected atom) in the atom mask in the RMS action is
>>> being corrupted somehow. Here you can see the first two elements are
>>> clearly incorrect (it should look like 0, 1, 2, 3...):
>>>
>>> (gdb) print tgtMask_.Selected_
>>> $12 = std::vector of length 9280, capacity 16384 = {775042866,
>>> 3158064, 2, 3, 4, 5, 6, 7, 8, 9,
>>>
>>> There is almost no way this could happen without some sort of memory
>>> corruption since the routine that sets up the selected array
>>> (Selected_) looks like this (AtomMask.cpp):
>>>
>>> Selected_.clear();
>>> for (int atom = 0; atom != Natom_; atom++) {
>>> if (charmask[atom] == maskChar_)
>>> Selected_.push_back( atom );
>>> }
>>>
>>> When subsequent routines try to use this corrupted mask they hit the
>>> huge first index which is way out of range (in a 9280 atom system)
>>> which is what triggers the segfault that actually stops execution.
>>>
>>> Unfortunately one of the downsides to valgrind being thorough is that
>>> it is also slow. I've had the run going overnight and nothing has
>>> triggered yet. I'll keep you up to date with what I find.
>>>
>>> -Dan
>>>
>>> On Thu, Dec 8, 2022 at 10:17 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
>>>> Thanks, I'm downloading it now. I was able to run the given input with
>>>> cpptraj on the shorter trajectory you provided with no issues;
>>>> valgrind showed no memory errors. This is starting to feel like an
>>>> out-of-memory type issue, but I will keep digging.
>>>>
>>>> I'm already seeing some areas where quality of life improvements can
>>>> be made to cpptraj (e.g. every frame does not need to be printed to
>>>> stdout for 'onlyframes' etc).
>>>>
>>>> I'll report when/if I find anything. Thanks for the files.
>>>>
>>>> -Dan
>>>>
>>>> On Thu, Dec 8, 2022 at 8:51 AM Andrzej Dorobisz via AMBER
>>>> <amber.ambermd.org> wrote:
>>>>> Hi,
>>>>> I just uploaded the input data (22 GB) so you can download and test
>>>>> cpptraj on it.
>>>>>
>>>>> - file E81A.nc
>>>>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9
>>>>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/E81A.nc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134155Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=d01eba1bf5f637ccd55f1f28cdd0623ded099ad95a226a66108f8bc8cc1eeca9>
>>>>>
>>>>> - all other files (E81A.top + input-cpptraj.txt)
>>>>> https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70
>>>>> <https://s3.cloud.cyfronet.pl/share/amber-cpptraj-issue/cpptraj-SIGSEGV-files.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=71M2J3OGZ6O5J6K1WAFP%2F20221208%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221208T134129Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=ca2814b488dbe69ed12261ad8b64b057870d155779cba7b147ce9d05af2f7f70>
>>>>>
>>>>> The error occurs after about 1 hour and 30 minutes.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Andrzej
>>>>>


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Dec 14 2022 - 02:00:03 PST
Custom Search