[AMBER] amber 14 tests errors

From: Jordi Bujons <jordi.bujons.iqac.csic.es>
Date: Thu, 28 Aug 2014 20:23:27 +0200

Hello Amber users,

 

I have installed Amber14 and AmberTools 14 on a server and a workstation
which are running SUSE SLES 11-SP1 (kernel 2.6.32.59-0.7, gcc 4.3.4, mpirun
(Open MPI) 1.2.8) and OpenSuse 12.3 (kernel 3.7.10-1.16, gcc 4.7.2, mpirun
(Open MPI) 1.6) respectively. The serial and parallel versions of Amber,
with all the updates, installed without problems in both machines by
following the directions provided in the manual and in the Jason Swails'
wiki, and using the gnu and openmpi compilers.

When running the tests in both machines, everything looked OK for the
workstation but I got some warning/error messages for the server. These
messages seemed to be related to some mpi DAPL libraries (libdaplscm.so.1,
libdaplcma.so.1,...) and I got a whole bunch of warnings similar to:

 

:

:

--------------------------------------------------------------------------

 

WARNING: Failed to open "OpenIB-cma"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].

This may be a real error or it may be an invalid entry in the uDAPL

Registry which is contained in the dat.conf file. Contact your local

System Administrator to confirm the availability of the interfaces in

the dat.conf file.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

 

WARNING: Failed to open "OpenIB-cma-1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].

This may be a real error or it may be an invalid entry in the uDAPL

Registry which is contained in the dat.conf file. Contact your local

System Administrator to confirm the availability of the interfaces in

the dat.conf file.

--------------------------------------------------------------------------

:

:

--------------------------------------------------------------------------

DAT: library load failure: /usr/lib64/libdaplcma.so.1: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.1: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.1: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.1: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.1: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.2: undefined symbol:
dat_registry_add_provider

DAT: library load failure: /usr/lib64/libdaplscm.so.2: undefined symbol:
dat_registry_add_provider

--------------------------------------------------------------------------

:

:

 

 

 

 

After searching on Google I found that editing the
/etc/openmpi-mca-params.conf file and adding the following line

 

btl = ^udapl

 

to the bottom of the file, could help to remove those messages, and indeed
after recompiling and re-running the tests on the server I got on only a few
errors, over which I would like some advice if possible.

 

 

- On the test_at_serial section, I got 5 comparisons failed:

 

 

$ cat test_at_serial/at_summary

1265 file comparisons passed

5 file comparisons failed

0 tests experienced errors

Test log file saved as
/Programs/amber14/logs/test_at_serial/2014-08-27_11-59-41.log

Test diffs file saved as
/Programs/amber14/logs/test_at_serial/2014-08-27_11-59-41.diff

 

$ cat test_at_serial/2014-08-27_11-59-41.diffs

possible FAILURE: check prmtop.new_type.dif

/Programs/amber14/AmberTools/test/parmed/chamber_prmtop

5025,5028c5025,5028

< 7.0320002812773935E+4 3.0991959086075940E+3 1.5730411891415534E+6

< 1.2254480396417496E+6 1.5350319889737577E+6 1.8962122985815424E+6

< 1.6150318076691339E+6 5.5913203238501586E+5 5.5913203238501586E+5

< 7.5015433968630561E+5 1.2254480396417496E+6 1.5350319889737577E+6

---
>   7.0320002812773819E+4  3.0991959086075940E+3  1.5730411891415534E+6
>   1.2254480396417496E+6  1.5350319889737575E+6  1.8962122985815424E+6
>   1.6150318076691339E+6  5.5913203238501656E+5  5.5913203238501656E+5
>   7.5015433968630666E+5  1.2254480396417496E+6  1.5350319889737575E+6
:
:
:
---------------------------------------
possible FAILURE:  check prmtop.NBFIX.dif
/Programs/amber14/AmberTools/test/parmed/chamber_prmtop
5025,5028c5025,5028
<   7.0320002812773935E+4  3.0991959086075940E+3  1.5730411891415534E+6
<   1.2254480396417496E+6  1.5350319889737577E+6  1.8962122985815424E+6
<   1.6150318076691339E+6  5.5913203238501586E+5  5.5913203238501586E+5
<   7.5015433968630561E+5  1.2254480396417496E+6  1.5350319889737577E+6
---
>   7.0320002812773819E+4  3.0991959086075940E+3  1.5730411891415534E+6
>   1.2254480396417496E+6  1.5350319889737575E+6  1.8962122985815424E+6
>   1.6150318076691339E+6  5.5913203238501656E+5  5.5913203238501656E+5
>   7.5015433968630666E+5  1.2254480396417496E+6  1.5350319889737575E+6
:
:
:
---------------------------------------
possible FAILURE:  check prmtop.new_chg.dif
/Programs/amber14/AmberTools/test/parmed/chamber_prmtop
5025,5028c5025,5028
<   7.0320002812773935E+4  3.0991959086075940E+3  1.5730411891415534E+6
<   1.2254480396417496E+6  1.5350319889737577E+6  1.8962122985815424E+6
<   1.6150318076691339E+6  5.5913203238501586E+5  5.5913203238501586E+5
<   7.5015433968630561E+5  1.2254480396417496E+6  1.5350319889737577E+6
---
>   7.0320002812773819E+4  3.0991959086075940E+3  1.5730411891415534E+6
>   1.2254480396417496E+6  1.5350319889737575E+6  1.8962122985815424E+6
>   1.6150318076691339E+6  5.5913203238501656E+5  5.5913203238501656E+5
>   7.5015433968630666E+5  1.2254480396417496E+6  1.5350319889737575E+6
:
:
:
---------------------------------------
possible FAILURE:  check final.prmtop.dif
/Programs/amber14/AmberTools/test/parmed/chamber_prmtop
5025,5028c5025,5028
<   7.0320002812773935E+4  3.0991959086075940E+3  1.5730411891415534E+6
<   1.2254480396417496E+6  1.5350319889737577E+6  1.8962122985815424E+6
<   1.6150318076691339E+6  5.5913203238501586E+5  5.5913203238501586E+5
<   7.5015433968630561E+5  1.2254480396417496E+6  1.5350319889737577E+6
---
>   7.0320002812773819E+4  3.0991959086075940E+3  1.5730411891415534E+6
>   1.2254480396417496E+6  1.5350319889737575E+6  1.8962122985815424E+6
>   1.6150318076691339E+6  5.5913203238501656E+5  5.5913203238501656E+5
>   7.5015433968630666E+5  1.2254480396417496E+6  1.5350319889737575E+6
:
:
:
---------------------------------------
possible FAILURE:  check mdout.pbsa_decres.dif
/Programs/amber14/test/sander_pbsa_decres
1320c1320
<  Total surface charge      -5.9357
>  Total surface charge      -5.9270
1321c1321
<  Reaction field energy  -2007.5763
>  Reaction field energy  -1609.3322
1322c1322
<  Cavity solvation energy     38.8528
>  Cavity solvation energy     39.1489
1324c1324
<       1      -3.0345E+3     1.6433E+1     8.6691E+1     C        1030
>       1      -2.6359E+3     1.6488E+1     8.6861E+1     C        1030
1326c1326
<  VDWAALS =     -546.2276  EEL     =    -4637.2832  EPB        =
-2007.5763
>  VDWAALS =     -546.2276  EEL     =    -4637.2832  EPB        =
-1609.3322
:
:
:
1593c1593
< BDC     76     7.328    -1.602   -33.516   -44.142     0.
> BDC     76     7.328    -1.602   -33.516   -51.973     0.
### Maximum absolute error in matching lines = 3.99e+02 at line 1324 field 2
### Maximum relative error in matching lines = 6.96e+01 at line 1566 field 6
---------------------------------------
:
:
 
I guess that the first 4 possible failures could be just rounding errors,
although sometimes they affect the last three decimal places (e.g.
7.0320002812773935E+4  vs 7.0320002812773819E+4). But the sander_pbsa_decres
test shows larger errors. Are these a signal that something is really not
right?
 
 
- On the test_amber_parallel section I got no errors when using
"DO_PARALLEL='mpirun -np 2'" but I got one error when using -np 8:
 
$ cat test_amber_parallel/2014-08-28_11-40-33.log
:
:
:
==============================================================
export TESTsander='../../bin/pmemd.MPI' && cd multid_remd && ./Run.multirem
 
Running multipmemd version of pmemd Amber12
    Total processors =     8
    Number of groups =     4
 
[mordor9:08289] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1
mpirun noticed that job rank 1 with PID 8290 on node mordor9 exited on
signal 15 (Terminated).
6 additional processes aborted (not shown)
./Run.multirem: Program error
make[2]: [test.pmemd.REM] Error 1 (ignored)
export TESTsander='../../bin/pmemd.MPI'; cd rxsgld_4rep && ./Run.rxsgld
 
Running multipmemd version of pmemd Amber12
    Total processors =     8
    Number of groups =     4
 
diffing rxsgld.log.save with rxsgld.log
PASSED
==============================================================
:
:
make[2]: Leaving directory `/Programs/amber14/test'
166 file comparisons passed
0 file comparisons failed
1 tests experienced an error
Test log file saved as
/Programs/amber14/logs/test_amber_parallel/2014-08-28_11-40-33.log
No test diffs to save!
 
 
- Finally, on the test_at_parallel section when using "DO_PARALLEL='mpirun
-np 2'" I got one error:
 
$ cat 2014-08-28_10-19-35.logs
:
:
:
==============================================================
export TESTsander=/Programs/amber14/bin/sander.MPI; cd cnstph_remd/TempRem
&& ./Run.cnstph_remd
 
Running multisander version of sander Amber14
    Total processors =     2
    Number of groups =     2
 
[mordor9:02807] *** Process received signal ***
[mordor9:02807] Signal: Segmentation fault (11)
[mordor9:02807] Signal code: Address not mapped (1)
[mordor9:02807] Failing at address: 0x100
[mordor9:02807] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b3e4188c6b0]
[mordor9:02807] [ 1]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_pml_ob1.so(+0x74d0)
[0x2b3e462fe4d0]
[mordor9:02807] [ 2]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_btl_sm.so(mca_btl_sm_component_
progress+0x670) [0x2b3e46d8eed0]
[mordor9:02807] [ 3]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0
x2b) [0x2b3e4650b1fb]
[mordor9:02807] [ 4]
/usr/lib64/mpi/gcc/openmpi/lib64/libopen-pal.so.0(opal_progress+0x4a)
[0x2b3e408c2fba]
[mordor9:02807] [ 5]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait
+0x1d) [0x2b3e42c5998d]
[mordor9:02807] [ 6]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x4
37) [0x2b3e42c5d0a7]
[mordor9:02807] [ 7]
/usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(mca_oob_recv_packed+0x33)
[0x2b3e40685df3]
[mordor9:02807] [ 8]
/usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_inc
rement_value+0x1e2) [0x2b3e43068712]
[mordor9:02807] [ 9]
/usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(orte_smr_base_set_proc_sta
te+0x2ac) [0x2b3e4069c9ac]
[mordor9:02807] [10]
/usr/lib64/mpi/gcc/openmpi/lib64/libmpi.so.0(ompi_mpi_finalize+0x111)
[0x2b3e403f5571]
[mordor9:02807] [11]
/usr/lib64/mpi/gcc/openmpi/lib64/libmpi_f77.so.0(MPI_FINALIZE+0x9)
[0x2b3e401ae549]
[mordor9:02807] [12] /Programs/amber14/bin/sander.MPI(mexit_+0x4f)
[0x5d723f]
[mordor9:02807] [13] /Programs/amber14/bin/sander.MPI(MAIN__+0xd21)
[0x4f17d5]
[mordor9:02807] [14] /Programs/amber14/bin/sander.MPI(main+0x2c) [0xc02d1c]
[mordor9:02807] [15] /lib64/libc.so.6(__libc_start_main+0xe6)
[0x2b3e41ab8bc6]
[mordor9:02807] [16] /Programs/amber14/bin/sander.MPI() [0x46f9e9]
[mordor9:02807] *** End of error message ***
:
:
:
Finished test suite for AmberTools at Thu Aug 28 11:00:14 CEST 2014.
 
make[2]: Leaving directory `/Programs/amber14/AmberTools/test'
541 file comparisons passed
0 file comparisons failed
1 tests experienced errors
Test log file saved as
/Programs/amber14/logs/test_at_parallel/2014-08-28_10-19-35.log
No test diffs to save!
 
 
 
...and when using "DO_PARALLEL='mpirun -np 8'", 3 file comparisons failed
and 4 tests experienced errors:
 
 
 
$ cat 2014-08-28_11-15-54.logs
:
:
:
==============================================================
export TESTsander=/Programs/amber14/bin/sander.MPI; cd multid_remd &&
./Run.multirem
 
Running multisander version of sander Amber14
    Total processors =     8
    Number of groups =     4
 
[mordor9:24613] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1
mpirun noticed that job rank 1 with PID 24614 on node mordor9 exited on
signal 15 (Terminated).
6 additional processes aborted (not shown)
./Run.multirem: Program error
make[2]: [test.sander.REM] Error 1 (ignored)
export TESTsander=/Programs/amber14/bin/sander.MPI; cd sodium &&
./Run.sodium
 
Running multisander version of sander Amber14
    Total processors =     8
    Number of groups =     2
 
diffing md1.o.save with md1.o
PASSED
==============================================================
:
:
==============================================================
cd qmmm2/xcrd_build_test/ && ./Run.ortho_qmewald0
 
* NB pairs          145      185645 exceeds capacity (      185750)   3
     SIZE OF NONBOND LIST =     185750
SANDER BOMB in subroutine nonbond_list
Non bond list overflow!
check MAXPR in locmem.f
[mordor9:28926] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1
mpirun noticed that job rank 0 with PID 28923 on node mordor9 exited on
signal 15 (Terminated).
6 additional processes aborted (not shown)
  ./Run.ortho_qmewald0:  Program error
make[3]: [test.sander.QMMM] Error 1 (ignored)
cd qmmm2/xcrd_build_test/ && ./Run.truncoct_qmewald0
 
* NB pairs          174      185738 exceeds capacity (      185750)   3
    SIZE OF NONBOND LIST =     185750
SANDER BOMB in subroutine nonbond_list
Non bond list overflow!
check MAXPR in locmem.f
[mordor9:28941] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1
mpirun noticed that job rank 0 with PID 28938 on node mordor9 exited on
signal 15 (Terminated).
6 additional processes aborted (not shown)
  ./Run.truncoct_qmewald0:  Program error
make[3]: [test.sander.QMMM] Error 1 (ignored)
cd qmmm2/crambin_2 && ./Run.crambin
diffing crambin.out.save with crambin.out
PASSED
==============================================================
:
:
==============================================================
export TESTsander=/Programs/amber14/bin/sander.MPI && cd
qmmm2/adqmmm_h2o-box && ./Run.adqmmm-fixedR-calc_wbk2
 
Running multisander version of sander Amber14
    Total processors =     8
    Number of groups =     4
 
[mordor9:32011] MPI_ABORT invoked on rank 6 in communicator MPI_COMM_WORLD
with errorcode 1
mpirun noticed that job rank 0 with PID 32005 on node mordor9 exited on
signal 15 (Terminated).
6 additional processes aborted (not shown)
  ./Run.adqmmm-fixedR-calc_wbk2:  Program error
make[2]: [test.sander.ADQMMM.MPI] Error 1 (ignored)
export TESTsander=/Programs/amber14/bin/sander.MPI; make -k
test.sander.ABFQMMM
make[3]: Entering directory `/Programs/amber14/test'
cd abfqmmm/abfqmmm_water_sp && ./Run.abfqmmm_water_sp
diffing abfqmmm_water_sp.out.save with abfqmmm_water_sp.out
PASSED
==============================================================
:
:
Finished test suite for AmberTools at Thu Aug 28 11:40:31 CEST 2014.
 
make[2]: Leaving directory `/Programs/amber14/AmberTools/test'
437 file comparisons passed
3 file comparisons failed
4 tests experienced errors
Test log file saved as
/Programs/amber14/logs/test_at_parallel/2014-08-28_11-15-54.log
Test diffs file saved as
/Programs/amber14/logs/test_at_parallel/2014-08-28_11-15-54.diff
 
 
$ cat 2014-08-28_11-15-54.diff
possible FAILURE:  check rem.out.000.dif
/Programs/amber14/test/rem_hybrid
197c197
<  BOND   =         6.2782  ANGLE   =        23.4418  DIHED      =
28.9262
>  BOND   =         6.2783  ANGLE   =        23.4418  DIHED      =
28.9262
### Maximum absolute error in matching lines = 1.00e-04 at line 197 field 3
### Maximum relative error in matching lines = 1.59e-05 at line 197 field 3
---------------------------------------
possible FAILURE:  check mdout.emil.0.5.dif
/Programs/amber14/test/emil/emil_sander_tip3p
100a101,104
> *****  Processor      0
> ***** System must be very inhomogeneous.
> *****  Readjusting recip sizes.
>  In this slab, Atoms found:       310  Allocated:       309
---------------------------------------
possible FAILURE:  check mdout.emil.0.9.dif
/Programs/amber14/test/emil/emil_sander_tip3p
118a119,122
> *****  Processor      0
> ***** System must be very inhomogeneous.
> *****  Readjusting recip sizes.
>  In this slab, Atoms found:       310  Allocated:       309
 
 
 
I looked on the Amber list but could not find what these parallel execution
errors mean. Do they imply some problem during compilation, bug or something
else?
 
Any help or suggestions will be greatly appreciated.
 
Jordi Bujons
 
----------------------------------------------------------------------------
----------
Jordi Bujons, PhD
Dept. of Biological Chemistry and Molecular Modeling (QBMM)
Institute of Advanced Chemistry of Catalonia (IQAC)
National Research Council of Spain (CSIC)
Address: Jordi Girona 18-26, 08034 Barcelona, Spain
Phone: +34 934006100 ext. 1291
FAX: +34 932045904
 <mailto:jordi.bujons.iqac.csic.es> jordi.bujons.iqac.csic.es
 <mailto:jbujons1.gmail.com> jbujons1.gmail.com
 <http://www.iqac.csic.es/> http://www.iqac.csic.es
----------------------------------------------------------------------------
----------
 
---
Este mensaje no contiene virus ni malware porque la protección de avast! Antivirus está activa.
http://www.avast.com
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Aug 28 2014 - 11:30:02 PDT
Custom Search