AMBER: sander.MPI / openmpi on PBS

From: Arturas Ziemys <arturas.ziemys.uth.tmc.edu>
Date: Wed, 11 Jun 2008 13:59:52 -0500

Hi,

I have compiled AMBER 9 under openMPI. My tests of sander.MPI runs well
(passed). When i run the way test do, i.e. directly from shell like
mpiexec -np x .... , it runs well. But we have batch system, and if I
run through PBS, I have errors. If I run a batch job like 'mpiexec -np x
...', sander.MPI runs just on single cpu. Other creates errors like:

[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x39000000
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e) [0x42029eae]
[Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x40018325]
[Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0
[0x400190f6]
[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63]
[Morpheus06:02155] [ 8]
/home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
[Morpheus06:02155] [ 9]
/home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
[Morpheus06:02155] [10]
/home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
[Morpheus06:02155] [11]
/home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
[Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4)
[0x42015574]
[Morpheus06:02155] [13]
/home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]
[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited
on signal 11 (Segmentation fault).
5 additional processes aborted (not shown)

If I supply $PBS_NODEFILE with options -mashinefile or --hostfile to
mpiexec I get:

Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c
at line 90
[Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as
expected.
[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1198
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------


Could anyone give any clue what to look for to check? openMPI? Cluster
setup?

-- 
Arturas Ziemys, PhD
  School of Health Information Sciences
  University of Texas Health Science Center at Houston
  7000 Fannin, Suit 880
  Houston, TX 77030
  Phone: (713) 500-3975
  Fax:   (713) 500-3929  
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo.scripps.edu
Received on Sun Jun 15 2008 - 06:07:18 PDT
Custom Search