[AMBER] Open MPI failure

From: David Winogradoff <dwino218.gmail.com>
Date: Mon, 6 May 2013 12:18:43 -0400

I am running remd with amber 12 on several different systems. Using Open
MPI, and the executable sander.MPI, I successfully ran many consecutive
jobs on a local supercomputer, but, then I reached a point where jobs no
longer ran due to Open MPI failures. Pasted below is the most important
text from a recent error message from one of the failed jobs. Any insight
into why such an error might occur, or how the problem could be fixed, is
greatly appreciated.

"...
[compute-f15-25.deepthought.umd.edu:16922] mca: base: component_find:
unable to open
/cell_root/software/openmpi/1.4.3/gnu/sys/lib/openmpi/mca_btl_openib:
perhaps a missing symbol, or compiled\
 for a different version of Open MPI? (ignored)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 176 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[compute-f15-25.deepthought.umd.edu:16915] [0]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libopen-pal.so.0(opal_backtrace_buffer+0x2e)
[0x2b086e233c7e]
[compute-f15-25.deepthought.umd.edu:16915] [1]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libmpi.so.0(ompi_mpi_abort+0x2a6)
[0x2b086dd48216]
[compute-f15-25.deepthought.umd.edu:16915] [2]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libmpi_f77.so.0(MPI_ABORT+0x25)
[0x2b086db032b5]
[compute-f15-25.deepthought.umd.edu:16915] [3]
func:/homes/dwinogra/amber12/bin/sander.MPI(mexit_+0x45) [0x59fea9]
[compute-f15-25.deepthought.umd.edu:16915] [4]
func:/homes/dwinogra/amber12/bin/sander.MPI(load_ewald_info_+0x3cd)
[0x54520d]
[compute-f15-25.deepthought.umd.edu:16915] [5]
func:/homes/dwinogra/amber12/bin/sander.MPI(mdread1_+0x4450) [0x4f04c5]
[compute-f15-25.deepthought.umd.edu:16915] [6]
func:/homes/dwinogra/amber12/bin/sander.MPI(sander_+0x28a) [0x4c3644]
[compute-f15-25.deepthought.umd.edu:16915] [7]
func:/homes/dwinogra/amber12/bin/sander.MPI [0x4c25de]
[compute-f15-25.deepthought.umd.edu:16915] [8]
func:/homes/dwinogra/amber12/bin/sander.MPI(main+0x34) [0x4c269e]
[compute-f15-25.deepthought.umd.edu:16915] [9]
func:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2b086f6ba994]
[compute-f15-25.deepthought.umd.edu:16915] [10]
func:/homes/dwinogra/amber12/bin/sander.MPI [0x44d1d9]
[compute-f15-22.deepthought.umd.edu:27779] [0]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libopen-pal.so.0(opal_backtrace_buffer+0x2e)
[0x2b48806abc7e]
[compute-f15-22.deepthought.umd.edu:27779] [1]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libmpi.so.0(ompi_mpi_abort+0x2a6)
[0x2b48801c0216]
[compute-f15-22.deepthought.umd.edu:27779] [2]
func:/cell_root/software/openmpi/1.4.3/gnu/sys/lib/libmpi_f77.so.0(MPI_ABORT+0x25)
[0x2b487ff7b2b5]
[compute-f15-22.deepthought.umd.edu:27779] [3]
func:/homes/dwinogra/amber12/bin/sander.MPI(mexit_+0x45) [0x59fea9]
[compute-f15-22.deepthought.umd.edu:27779] [4]
func:/homes/dwinogra/amber12/bin/sander.MPI(load_ewald_info_+0x3cd)
[0x54520d]
[compute-f15-22.deepthought.umd.edu:27779] [5]
func:/homes/dwinogra/amber12/bin/sander.MPI(mdread1_+0x4450) [0x4f04c5]
[compute-f15-22.deepthought.umd.edu:27779] [6]
func:/homes/dwinogra/amber12/bin/sander.MPI(sander_+0x28a) [0x4c3644]
[compute-f15-22.deepthought.umd.edu:27779] [7]
func:/homes/dwinogra/amber12/bin/sander.MPI [0x4c25de]
[compute-f15-22.deepthought.umd.edu:27779] [8]
func:/homes/dwinogra/amber12/bin/sander.MPI(main+0x34) [0x4c269e]
[compute-f15-22.deepthought.umd.edu:27779] [9]
func:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2b4881b32994]
[compute-f15-22.deepthought.umd.edu:27779] [10]
func:/homes/dwinogra/amber12/bin/sander.MPI [0x44d1d9]
--------------------------------------------------------------------------
mpirun has exited due to process rank 192 with PID 16915 on
node compute-f15-25.deepthought.umd.edu exiting without calling "finalize".
This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-f16-1.deepthought.umd.edu:12772] 1 more process has sent help
message help-mpi-api.txt / mpi-abort
[compute-f16-1.deepthought.umd.edu:12772] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages..."


Thanks,
David Winogradoff
~~~~~~~~~~~~~~~~~~
PhD Student
Chemical Physics
University of Maryland
College Park
~~~~~~~~~~~~~~~~~~
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 06 2013 - 09:30:03 PDT
Custom Search