[AMBER] Running AmberTools21 on HPC cluster using distributed memory

From: Manuel Fernandez Merino <manuel.fernandez.crg.eu>
Date: Wed, 12 Jan 2022 09:28:56 +0000

Dear Amber community,


I have been dealing with problems on the installation of AmberTools21 in my research institution cluster computer for some time now. I first installed the serial version and ran the tests and they seemed okay (excepting the tests for the nab module, but since I do not plan on using it, I decided to let it go).


Then, I went on to install the parallel version of AmberTools. The installation seemed alright, and so did the tests when I launched them onto the HPC cluster. To begin with, I just used the SMP environment employed for shared memory (i.e., only the cores within a single node in the cluster are employed. In our cluster, I'm able to use up to 16 cores this way). The tests also seemed alright (again, excepting those for nab).


Finally, I decided to try using distributed memory on the OpenMPI environment (the one that my cluster supports), so that I can use as many cores in different nodes as I require. Here's when I start to tumble upon many errors.


First, in the error log file from the HPC cluster I get the following message:


Error: LD_LIBRARY_PATH does not include $AMBERHOME/lib!

Amber now requires $AMBERHOME/lib to be added to your LD_LIBRARY_PATH
environment variable in order for all components to work.

We recommend adding the line:

   test -f /software/pcosma/el7.2/amber20///amber.sh && source /software/pcosma/el7.2/amber20///amber.sh (sh/bash/zsh)
or
   test -f /software/pcosma/el7.2/amber20///amber.csh && source /software/pcosma/el7.2/amber20///amber.csh (csh/tcsh)

to your login shell resource file (e.g., ~/.bashrc or ~/.cshrc).

make[1]: *** [Makefile:14: test.parallel] Error 1
make: [Makefile:30: test.parallel] Error 2 (ignored)


I have added the test and source lines in .bashrc, as well as the line:


export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$AMBERHOME/lib

(In an attempt to solve the LD_LIBRARY_PATH issue). When I launch any job in the cluster, I also get the notification that the LD_LIBRARY_PATH variable is not included in the environment because of a security issue, which I believe may have to do with this problem.


When I look at the log file from the test, things get even messier. I'm including a part of it here:



(cd AmberTools/test && make test.parallel)
make[1]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test'
./test_at_parallel.sh

Tests being run with DO_PARALLEL="mpirun -np 2".

make[1]: Leaving directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test'
(cd AmberTools/test && make test.parallel)
make[1]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test'
./test_at_parallel.sh

Tests being run with DO_PARALLEL="mpirun -np 2".

make[2]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test'
./test_at_clean.sh

[... I'm just cutting this part out because I do not believe is crucial]

cd nab && make -k test testrism
make[3]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test/nab'
Running test to do simple minimization
(this tests the molecular mechanics interface)

Error executing ucpp: No such file or directory
Preprocessor failed with exit code 256
./Run.sff: Program error
make[3]: *** [Makefile:46: sff_test] Error 1
Running test to do simple minimization with shake
(this tests the molecular mechanics interface)

Error executing ucpp: No such file or directory
Preprocessor failed with exit code 256
./Run.shake: Program error
make[3]: *** [Makefile:49: rattle_min_test] Error 1
Running test to do simple minimization
(this tests the generalized Born implementation)

[... Here, more nab errors of this kind appear. They were also present when running test in serial and SMP parallel version)

Error executing ucpp: No such file or directory
Preprocessor failed with exit code 256
./Run.rism_mdiis0: Program error
make[3]: *** [Makefile:178: rism_mdiis0] Error 1
Running test to do basic MD (librism)
(this tests the 3D-RISM-KH implementation)

Error executing ucpp: No such file or directory
Preprocessor failed with exit code 256
./Run.rism_mdiis1: Program error
make[3]: *** [Makefile:181: rism_mdiis1] Error 1
Running test to do basic MD (librism)
(trajectory processing using the 3D-RISM command line interface)

mpirun -np 2 /software/pcosma/el7.2/amber20///bin/rism3d.snglpnt.MPI


**********************************************************

mpirun does not support recursive calls

**********************************************************
./Run.rism_sp: Program error
make[3]: *** [Makefile:169: rism_sp] Error 1
Running test for 3D-RISM closure list (librism)
(trajectory processing test)

mpirun -np 2 /software/pcosma/el7.2/amber20///bin/rism3d.snglpnt.MPI

**********************************************************

mpirun does not support recursive calls

**********************************************************
./Run.rism_sp_list: Program error
make[3]: *** [Makefile:175: rism_sp_list] Error 1
Running test to do simple minimization (librism)
(this tests the 3D-RISM implementation)

Error executing ucpp: No such file or directory
Preprocessor failed with exit code 256
./Run.rism_xmin: Program error
make[3]: *** [Makefile:184: rism_xmin] Error 1
Running test to do basic MD (librism)
(trajectory processing using the 3D-RISM command line interface)

mpirun -np 2 /software/pcosma/el7.2/amber20///bin/rism3d.snglpnt.MPI


**********************************************************

mpirun does not support recursive calls

**********************************************************
./Run.rism_selftest_kh: Program error
make[3]: *** [Makefile:187: rism_selftest_kh] Error 1
Running test to do basic MD (librism)
(trajectory processing using the 3D-RISM command line interface)

[Here, this same problem appears many times in many different tests]

mpirun -np 2 /software/pcosma/el7.2/amber20///bin/rism3d.snglpnt.MPI


**********************************************************

mpirun does not support recursive calls

**********************************************************
./Run.rism_sp_nacl_tree_fast: Program error
make[3]: *** [Makefile:244: rism_sp_nacl_tree_fast] Error 1
Running test to do basic MD (librism)
(trajectory processing using the 3D-RISM command line interface)

mpirun -np 2 /software/pcosma/el7.2/amber20///bin/rism3d.snglpnt.MPI


**********************************************************

mpirun does not support recursive calls

**********************************************************
./Run.rism_sp_astol_fast: Program error
make[3]: *** [Makefile:247: rism_sp_astol_fast] Error 1
make[3]: Target 'testrism' not remade because of errors.
make[3]: Leaving directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/test/nab'
make[2]: *** [Makefile:145: test.nab] Error 2
cd ../src/cpptraj/test && make -k test
make[3]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/src/cpptraj/test'
make test.complete summary
make[4]: Entering directory '/nfs/software/pcosma/el7.2/amber20/AmberTools/src/cpptraj/test'
[node-hp0511.linux.crg.es:14393] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168



After this line, the testing just gets stuck. I believe this last line refers to a problem in the mpi implementation, and I have tried some quick fixes I found in the internet, such as exporting PMIX_MCA_gds=^12 or PMIX_MCA_gds=hash in .bashrc. Does anybody with expierence on using Amber on a HPC cluster know what is going on and how to fix this? I would be very thankful to receive any advice.


Kind regards and many thanks,


Manuel F. Merino

PhD Candidate

Centre for Genomic Regulation

Barcelona, Spain
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jan 12 2022 - 01:30:02 PST
Custom Search