Re: [AMBER] [EXTERNAL] Re: Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries

From: Chakrabarti, Mayukh \(NIH/NCI\) \[C\] via AMBER <"Chakrabarti,>
Date: Tue, 14 May 2024 22:34:07 +0000

Hi Yongxian,

I wanted to briefly follow up on my previous message with the result of some additional testing. I compiled Amber24 with Libtorch 2.1.0 and CUDA 12.0 with the standard ‘release build’, without using the ‘-DCMAKE_BUILD_TYPE=Debug’ flag. Interestingly, in this build, all of the PBSA tests run successfully, even including the ‘Run.dmp.sasopt2’ PBSA test that fails using CUDA 11.8 and CUDA 11.7, as you reported in your earlier message:

cd pbsa_dmp && ./test
working on ./Run.dmp.sasopt0
diffing mdout.dmp.min_0.save with mdout.dmp.min_0
PASSED
==============================================================
working on ./Run.dmp.sasopt1
diffing mdout.dmp.min_1.save with mdout.dmp.min_1
PASSED
==============================================================
working on ./Run.dmp.sasopt2
diffing mdout.dmp.min_2.save with mdout.dmp.min_2
PASSED
==============================================================

This suggests that the SIGSEGV error may at least be related to the version of CUDA used, if not Libtorch, since the same PBSA test that fails using Libtorch 2.1.0 and CUDA 11.8 passes when using Libtorch 2.1.0 and CUDA 12.0. Please let me know if you are able to replicate this behavior if you get an opportunity to compile Amber24 with Libtorch 2.1.0 and CUDA 12.0.

Best,

Mayukh Chakrabarti (he/him)
COMPUTATIONAL SCIENTIST

From: Chakrabarti, Mayukh (NIH/NCI) [C] via AMBER <amber.ambermd.org>
Date: Monday, May 13, 2024 at 5:39 PM
To: Yongxian Wu <yongxian.wu.uci.edu>
Cc: Ray Luo <rluo.uci.edu>, Chakrabarti, Mayukh (NIH/NCI) [C] via AMBER <amber.ambermd.org>
Subject: Re: [AMBER] [EXTERNAL] Re: Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries
Hi Yongxian,

Thank you very much for looking into this further. Based on your advice, I re-compiled Amber24 with debugging symbols, using my existing Torch 2.1.0 with CUDA 11.8 setup. Miraculously, upon compiling with the ‘-DCMAKE_BUILD_TYPE=Debug’ flag, I am able to replicate your testing result exactly. I did not make any changes to my ‘run_cmake’ compilation script other than the addition of this build flag. The PBSA tests that previously failed are now passing, and I encounter the same memory error that you encounter with Run.dmp.sasopt2:

working on ./Run.dmp.sasopt2

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:

Could not print backtrace: unrecognized DWARF version in .debug_info at 6
Could not print backtrace: unrecognized DWARF version in .debug_info at 6
Could not print backtrace: unrecognized DWARF version in .debug_info at 6
Could not print backtrace: unrecognized DWARF version in .debug_info at 6
#0 0x7f242a5e8171 in ???
#1 0x7f242a5e7313 in ???
#2 0x7f242a06fc0f in ???
#3 0x452554 in pb_lpbene
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pb_fdfrc.F90:1296
#4 0x452554 in pb_fdfrc_
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pb_fdfrc.F90:504
#5 0x434aaa in __poisson_boltzmann_MOD_pb_force
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pb_force.F90:438
#6 0x420499 in force_
                at <install-dir> /amber24_src/AmberTools/src/pbsa/force.F90:168
#7 0x41fe50 in runmin_
                at <install-dir> /amber24_src/AmberTools/src/pbsa/runmin.F90:155
#8 0x40cd28 in pbsa_
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pbsa.F90:195
#9 0x40eb4f in pbsamain
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pbsa.F90:26
#10 0x40eb95 in main
                at <install-dir> /amber24_src/AmberTools/src/pbsa/pbsa.F90:30
Segmentation fault (core dumped)
  ./Run.dmp.sasopt2: Program error

This makes me wonder, is there some difference in the compilation of the ‘Debug’ build vs. the default ‘release’ build that could be causing the PBSA tests to fail in the ‘release’ build, and not in the ‘Debug’ build?

Out of curiosity, I also went back and re-compiled my version of Amber24 with Libtorch 1.12.1 and CUDA 10.2 using the ‘-DCMAKE_BUILD_TYPE=Debug’ flag, but doing this did not fix the SIGSEGV issues in the PBSA tests, nor did it allow me to obtain a readable backtrace (despite the Debug build).

Best,

Mayukh Chakrabarti (he/him)
COMPUTATIONAL SCIENTIST

From: Yongxian Wu <yongxian.wu.uci.edu>
Date: Saturday, May 11, 2024 at 10:47 PM
To: Chakrabarti, Mayukh (NIH/NCI) [C] <mayukh.chakrabarti.nih.gov>
Cc: Chakrabarti, Mayukh (NIH/NCI) [C] via AMBER <amber.ambermd.org>, Ray Luo <rluo.uci.edu>
Subject: Re: [AMBER] [EXTERNAL] Re: Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries

Hi Mayukh,

I tested Amber 24 with Libtorch 2.0.1 and CUDA 11.7, and I could not reproduce the error you encountered. In my tests, both pbsa_bcopt and pbsa_saopt passed without any segmentation errors. Here is a screenshot of my testing results.
[cid:ii_lw2xizn40]

However, I did find a memory issue with pbsa, although it seems unrelated to Libtorch. When I run multiple pbsa testing processes simultaneously, the pbsa_dmp test case fails due to an invalid memory reference.
[cid:ii_lw2xjrak1]

I attempted to compile Amber with debugging symbols to obtain readable backtrace information. The error message appears as follows:
[cid:ii_lw2xkk822]

Based on my initial investigation, I believe this problem is caused by a Fortran memory issue and occurs only when multiple tests are run simultaneously; otherwise, this test case would pass smoothly without any errors.

To solve your problem, I suggest the following steps:

  1. Please align your environment settings with mine to avoid any unexpected errors. You can try using Libtorch 2.0.1 + CUDA 11.7, as I have already tested it. If you have CUDA 11.8 installed, it should be quite straightforward to switch to CUDA 11.7.
  2. Since the backtrace information attached in your email is unreadable, you might consider recompiling Amber 24 with debugging symbols to get a readable error message. To do this, simply add -DCMAKE_BUILD_TYPE=Debug to your cmake command.


Let me know if you need further assistance.



Best,

Yongxian Wu

On Fri, May 10, 2024 at 7:07 AM Chakrabarti, Mayukh (NIH/NCI) [C] <mayukh.chakrabarti.nih.gov<mailto:mayukh.chakrabarti.nih.gov>> wrote:
Hi Yongxian,

Thank you for testing this. As mentioned in my earlier message, I re-tested by compiling a version of Amber24 with Pytorch 1.12.1 and CUDA 10.2. I used ‘libtorch-shared-with-deps-1.12.1+cu102.zip’, which corresponds to the specific CUDA version. As you mentioned, the compilation and build process proceeds without any issues. However, while the ‘make test.serial’ runs and reports the final testing statistics when it completes, it still reports errors when testing the pbsa binaries, e.g.:

cd pbsa_bcopt && ./Run.dmp.min

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7ff0f7ca0171 in ???
#1 0x7ff0f7c9f313 in ???
#2 0x7ff0f735cc0f in ???
#3 0x7ff15479a219 in ???
#4 0x7ff0f735f856 in ???
#5 0x7ff15477b722 in ???
Segmentation fault (core dumped)
  ./Run.dmp.min: Program error
make[2]: [Makefile:199: test.pbsa] Error 1 (ignored)

cd pbsa_saopt && ./Run.dmp.min

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7ff58f59f171 in ???
#1 0x7ff58f59e313 in ???
#2 0x7ff58ec5bc0f in ???
#3 0x7ff5ec099219 in ???
#4 0x7ff58ec5e856 in ???
#5 0x7ff5ec07a722 in ???
Segmentation fault (core dumped)
  ./Run.dmp.min: Program error

I ran ‘ldd’ on both the pbsa and pbsa.cuda binaries, and there don’t appear to be any missing linked libraries. You mentioned that “Libtorch should still function even if you encounter errors with pbsa_test”. Does this mean it is safe to ignore these SIGSEGV errors, assuming that libtorch is otherwise properly installed and corresponds to the CUDA version being used? Is it some issue with the test itself?

Best,

Mayukh Chakrabarti (he/him)
COMPUTATIONAL SCIENTIST

From: Yongxian Wu <yongxian.wu.uci.edu<mailto:yongxian.wu.uci.edu>>
Date: Friday, May 10, 2024 at 1:56 AM
To: Chakrabarti, Mayukh (NIH/NCI) [C] <mayukh.chakrabarti.nih.gov<mailto:mayukh.chakrabarti.nih.gov>>, AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
Cc: Ray Luo <rluo.uci.edu<mailto:rluo.uci.edu>>
Subject: Re: [AMBER] [EXTERNAL] Re: Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries
Hi Mayukh,

I have tested compiling Amber24 with CUDA 11.7.0 and PyTorch 2.0.1. The compilation and build process proceeded without any issues, and make test.serial passed without any stopping errors. Please ensure you use the version of libtorch that corresponds to the specific CUDA version. However, note that the pbsa_test did not involve libtorch or MLSES. Libtorch should still function even if you encounter errors with pbsa_test. Here are the specific variables I used:


    -DLIBTORCH=ON \

    -DCUDA_HOME=$CUDA_PATH/cuda_11.7.0 \

    -DTORCH_HOME=$LIBTORCH_PATH/libtorch \

    -DCUDA_TOOLKIT_ROOT_DIR=/$CUDA_PATH/cuda_11.7.0 \

    -DCUDNN=TRUE \

    -DCUDNN_INCLUDE_PATH=$CUDNN_PATH/include \

    -DCUDNN_LIBRARY_PATH=$CUDNN_PATH/lib/libcudnn.so \





Best regards,

Yongxian Wu

On Wed, May 8, 2024 at 2:26 PM Chakrabarti, Mayukh (NIH/NCI) [C] via AMBER <amber.ambermd.org<mailto:amber.ambermd.org>> wrote:
Hi Ray and Yongxian,

Thank you for your response. I don’t have CUDA 11.3 available on my system, but I do have CUDA 11.8, which I tried to use instead. Upon your advice, I tried to re-compile Amber 24 & AmberTools 24 using an updated version of libtorch, version 2.1.0 with CUDA 11.8. I also used “cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz”. Unfortunately, following compilation, I still encounter the same segmentation fault errors (SIGSEGV) in the pbsa binaries. I will next try with PyTorch 1.12.1 and CUDA 10.2, and report back.

Best,

Mayukh Chakrabarti (he/him)
COMPUTATIONAL SCIENTIST

From: Ray Luo <rluo.uci.edu<mailto:rluo.uci.edu>>
Date: Wednesday, May 8, 2024 at 1:04 PM
To: Chakrabarti, Mayukh (NIH/NCI) [C] <mayukh.chakrabarti.nih.gov<mailto:mayukh.chakrabarti.nih.gov>>, AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
Subject: [EXTERNAL] Re: [AMBER] Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries
Hi Mayukh,

The issue might be due to the version of PyTorch being used.

I recommend using the same version of libtorch that we currently have, which is version 1.12.1 with CUDA 11.3. PyTorch 1.9 with CUDA 10.2 is outdated. However, PyTorch 1.12.1 also supports CUDA 10.2, although we recommend using CUDA 11 since CUDA 10 is outdated.

Let me know if there are any other questions.

Best,
Yongxian and Ray
--
Ray Luo, Ph.D.
Professor of Structural Biology/Biochemistry/Biophysics,
Chemical and Materials Physics, Chemical and Biomolecular Engineering,
Biomedical Engineering, and Materials Science and Engineering
Department of Molecular Biology and Biochemistry
University of California, Irvine, CA 92697-3900


On Tue, May 7, 2024 at 8:57 AM Chakrabarti, Mayukh (NIH/NCI) [C] via AMBER <amber.ambermd.org<mailto:amber.ambermd.org><mailto:amber.ambermd.org<mailto:amber.ambermd.org>>> wrote:
Hello,

I wanted to update my report below to mention that the issue with pbsa binaries breaking upon enabling the LibTorch & cudnn libraries still persists in Amber24 with AmberTools 24. I compiled with CUDA 10.2, gcc 8.5, OpenMPI 4.1.5, python 3.10, and cmake 3.25.2 on a Red Hat Enterprise Linux release 8.8 (Ootpa) system. I am not aware of any workaround or resolution for this issue.

Best,

Mayukh Chakrabarti (he/him)
COMPUTATIONAL SCIENTIST

From: Chakrabarti, Mayukh (NIH/NCI) [C] <mayukh.chakrabarti.nih.gov<mailto:mayukh.chakrabarti.nih.gov><mailto:mayukh.chakrabarti.nih.gov<mailto:mayukh.chakrabarti.nih.gov>>>
Date: Tuesday, October 31, 2023 at 11:28 AM
To: amber.ambermd.org<mailto:amber.ambermd.org><mailto:amber.ambermd.org<mailto:amber.ambermd.org<mailto:amber.ambermd.org%3e%3cmailto:amber.ambermd.org%3cmailto:amber.ambermd.org>>> <amber.ambermd.org<mailto:amber.ambermd.org><mailto:amber.ambermd.org<mailto:amber.ambermd.org>>>
Subject: Amber22/AmberTools23: Enabling of libtorch & cudnn libraries breaks pbsa binaries
Hello,

I am encountering a problem in which enabling LibTorch libraries with Amber22 and AmberTools23 causes segmentation fault errors (SIGSEGV) in the pbsa binaries upon running the serial tests (make test.serial). I have successfully compiled a version of Amber22/AmberTools23 in which this library is not enabled, and none of the pbsa tests in AmberTools break (i.e., no segmentation fault errors).

Further details:

I am running my compilation with CUDA 10.2, gcc 8.5, OpenMPI 4.1.5, python 3.6, and cmake 3.25.1 on a Red Hat Enterprise Linux release 8.8 (Ootpa) system. As per the manual, I have tried both “Built-in” mode and “User-installed” mode, both resulting in the same errors. For the “User-installed” mode, I manually downloaded and extracted “libtorch-shared-with-deps-1.9.1+cu102.zip” from the PyTorch website to correspond to CUDA 10.2, and “cudnn-linux-x86_64-8.7.0.84_cuda10-archive.tar.xz” directly from the NVIDIA website, and specified the following variables to CMAKE:


    -DLIBTORCH=ON \

    -DTORCH_HOME=/path_to_libtorch \

    -DCUDNN=TRUE \

    -DCAFFE2_USE_CUDNN=1 \

    -DCUDNN_INCLUDE_PATH=/path_to_cudnn_include \

    -DCUDNN_LIBRARY_PATH=/path_to_libcudnn.so \


Building and compiling proceed without issue. However, when running the serial tests, I get errors akin to the following (example shown after sourcing amber.sh and running AmberTools pbsa_ligand test):

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7f480cb65171 in ???
#1 0x7f480cb64313 in ???
#2 0x7f480c221c0f in ???
#3 0x7f4869667219 in ???
#4 0x7f480c224856 in ???
#5 0x7f4869648722 in ???
./Run.t4bnz.min: line 34: 2436022 Segmentation fault (core dumped) $DO_PARALLEL $TESTpbsa -O -i min.in<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fmin.in%2F__%3B!!CzAuKJ42GuquVTTmVmPViYEvSg!MVPl6p4KbMJpbOqO568Ns0Kz5i7sbGNJ-naGNHZf9Ycly3ycDbDToZ4J3VrVyvCP1-1IQPwcf2tM6Z4nwjoll1JKRHaF82Q%24&data=05%7C02%7Cmayukh.chakrabarti%40nih.gov%7C31c8ab4842ad46791eb808dc73952e2e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638512331794003736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=VbrBfg1xWZ%2BAhn%2BsT5WX16%2BYYD3278mBiheBepnrBWE%3D&reserved=0<https://urldefense.com/v3/__http://min.in/__;!!CzAuKJ42GuquVTTmVmPViYEvSg!MVPl6p4KbMJpbOqO568Ns0Kz5i7sbGNJ-naGNHZf9Ycly3ycDbDToZ4J3VrVyvCP1-1IQPwcf2tM6Z4nwjoll1JKRHaF82Q$>><https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fmin.in%2F__%3B!!CzAuKJ42GuquVTTmVmPViYEvSg!NC0Jwxkjnbm_mXLPxsbm3fva29jev44srLmSv7hv0XZuaw716Egbv1itPhJek6uNYpgJ3GbSbeK6jyZvAA%24&data=05%7C02%7Cmayukh.chakrabarti%40nih.gov%7C31c8ab4842ad46791eb808dc73952e2e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638512331794014226%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2FUBhJta6XreNd4aiAPCX6s%2F8yvIK8X7oAMeqNv9II7I%3D&reserved=0 > -o $output < /dev/null
  ./Run.t4bnz.min: Program error

When running the exact same test with the version of Amber22 not containing the LibTorch libraries:

diffing mdout.lig.min.save with mdout.lig.min
PASSED
==============================================================

I have run ‘ldd’ on the pbsa binary to try to identify any libraries that may be missing, but there don’t appear to be any missing libraries. Could anybody please provide any insight into how to fix this issue?

Best,

Mayukh Chakrabarti
COMPUTATIONAL SCIENTIST







_______________________________________________
AMBER mailing list
AMBER.ambermd.org<mailto:AMBER.ambermd.org><mailto:AMBER.ambermd.org<mailto:AMBER.ambermd.org<mailto:AMBER.ambermd.org%3e%3cmailto:AMBER.ambermd.org%3cmailto:AMBER.ambermd.org>>>
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Flists.ambermd.org%2Fmailman%2Flistinfo%2Famber__%3B!!CzAuKJ42GuquVTTmVmPViYEvSg!PLN2_ZzXkAP-XJnmppjnXaq9ruj3FNFdRR163QR8ehrjX6YYBOUfKc8Eo0t0w7N8I3CtOwXatyYNzg%24&data=05%7C02%7Cmayukh.chakrabarti%40nih.gov%7C31c8ab4842ad46791eb808dc73952e2e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638512331794021432%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=z5CkieyNhSwJUIqvRSv2keiwyOhCy%2BIVC3GObhYVZOE%3D&reserved=0<https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!!CzAuKJ42GuquVTTmVmPViYEvSg!PLN2_ZzXkAP-XJnmppjnXaq9ruj3FNFdRR163QR8ehrjX6YYBOUfKc8Eo0t0w7N8I3CtOwXatyYNzg$>
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org<mailto:AMBER.ambermd.org>
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Flists.ambermd.org%2Fmailman%2Flistinfo%2Famber__%3B!!CzAuKJ42GuquVTTmVmPViYEvSg!NC0Jwxkjnbm_mXLPxsbm3fva29jev44srLmSv7hv0XZuaw716Egbv1itPhJek6uNYpgJ3GbSbeI1bOw3SQ%24&data=05%7C02%7Cmayukh.chakrabarti%40nih.gov%7C31c8ab4842ad46791eb808dc73952e2e%7C14b77578977342d58507251ca2dc2b06%7C0%7C0%7C638512331794025630%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=D82tN0NRCLeteENhW9O2VL7e1cftjYCK%2BvxSVZkqdIs%3D&reserved=0<https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!!CzAuKJ42GuquVTTmVmPViYEvSg!NC0Jwxkjnbm_mXLPxsbm3fva29jev44srLmSv7hv0XZuaw716Egbv1itPhJek6uNYpgJ3GbSbeI1bOw3SQ$>
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 14 2024 - 16:00:03 PDT
Custom Search