I apologize for the previous message. GMail decided it was time to send the
message a little prematurely.
We have recently been struggling with a particularly nasty issue with the
Amber10 PIMD tests. We have recently acquired a new compute machine, fitted
with two 4-core hyperthreading-capable Nehalems (Xeon 5520s). We have installed
Ubuntu 9.10 on the machine, which has had no problems with instability thusfar.
Despite this, we find that the PIMD tests without fail end up stalling when
more than 4 cores are used:
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ export
DO_PARALLEL='mpirun -np 4'
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ time
./Run.full_pimd_amoeba
Testing PIMD with amoeba force field
diffing pimd_amoeba.out.save with pimd_amoeba.out
PASSED
==============================================================
real 0m2.771s
user 0m0.110s
sys 0m0.030s
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ export
DO_PARALLEL='mpirun -np 8'
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ time
./Run.full_pimd_amoeba
^C
real 68m24.496s
user 0m0.210s
sys 0m0.010s
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ export
DO_PARALLEL='mpirun -np 12'
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ time
./Run.full_pimd_amoeba
Testing PIMD with amoeba force field
real 5m38.135s
user 0m0.080s
sys 0m0.030s
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ export
DO_PARALLEL='mpirun -np 16'
peker.polyphemos:/amber/amber10/test/PIMD/full_pimd_amoeba$ time
./Run.full_pimd_amoeba
Testing PIMD with amoeba force field
real 8m23.174s
user 0m0.060s
sys 0m0.040s
While in this stalled state, each of the sander.MPI processes is using nearly
100% of its core. This is reproducible with both OpenMPI-1.3.3 and
mpich2-1.2.1. In both cases, one can attach a debugger to one of the sander.MPI
processes to find that it is stuck in an MPI_Barrier,
peker.polyphemos:/usr/local/lib$ gdb ../../../exe/sander.MPI 32727
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<
http://www.gnu.org/software/gdb/bugs/>...
../../../exe/sander.MPI: No such file or directory.
Attaching to process 32727
Reading symbols from /amber/amber10/bin/sander.MPI...done.
Reading symbols from /usr/local/lib/libmpichf90.so.1.2...(no debugging
symbols found)...done.
Loaded symbols for /usr/local/lib/libmpichf90.so.1.2
Reading symbols from /usr/local/lib/libmpich.so.1.2...done.
Loaded symbols for /usr/local/lib/libmpich.so.1.2
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /usr/lib/libgfortran.so.3...Reading symbols from
/usr/lib/debug/usr/lib/libgfortran.so.3.0.0...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libgfortran.so.3
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...Reading symbols from
/usr/lib/debug/lib/libgcc_s.so.1...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007fe17d206e37 in sched_yield () from /lib/libc.so.6
(gdb) bt
#0 0x00007fe17d206e37 in sched_yield () from /lib/libc.so.6
#1 0x00007fe17e1011ea in MPID_nem_mpich2_blocking_recv (
progress_state=<value optimized out>, is_blocking=<value optimized out>)
at /home/peker/mpich2-1.2.1/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:962
#2 MPIDI_CH3I_Progress (progress_state=<value optimized out>,
is_blocking=<value optimized out>) at ch3_progress.c:144
#3 0x00007fe17e1451d7 in MPIC_Wait (request_ptr=0x7fe17e40ff98)
at helper_fns.c:512
#4 0x00007fe17e14640a in MPIC_Sendrecv (sendbuf=<value optimized out>,
sendcount=0, sendtype=1275068685, dest=1, sendtag=1, recvbuf=0x0,
recvcount=0, recvtype=1275068685, source=3, recvtag=1, comm=-2080374781,
status=0x1) at helper_fns.c:163
#5 0x00007fe17e0ed203 in MPIR_Barrier (comm_ptr=<value optimized out>)
at barrier.c:75
#6 0x00007fe17e0ed7af in PMPI_Barrier (comm=-2080374782) at barrier.c:421
#7 0x00007fe17e0ed16b in pmpi_barrier_ (v1=<value optimized out>,
ierr=0x7fe17e40f470) at barrierf.c:190
#8 0x000000000055b260 in timer_barrier_ ()
#9 0x000000000050d5ba in fft3d0rc_ ()
#10 0x000000000050f6f4 in fft_backrc_ ()
#11 0x00000000005b74dd in __amoeba_recip_MOD_am_recip_perm_field ()
#12 0x00000000005ba591 in am_nonbond_perm_fields_ ()
---Type <return> to continue, or q <return> to quit---
#13 0x00000000005df3a3 in __amoeba_induced_MOD_am_induced_eval ()
#14 0x00000000005b9156 in __amoeba_interface_MOD_am_nonbond_eval ()
#15 0x00000000006f133e in force_ ()
#16 0x00000000004d305a in runmd_ ()
#17 0x0000000000498af1 in sander_ ()
#18 0x00000000004932e8 in MAIN__ ()
(gdb) break PMPI_Barrier
Breakpoint 1 at 0x7fe17e0ed450: file barrier.c, line 365.
(gdb) c
Continuing.
[wait quite a while]
^C
Program received signal SIGINT, Interrupt.
0x00007fe17d206e37 in sched_yield () from /lib/libc.so.6
(gdb)
As can be seen from this debugger output, it seems that this process never
passes the barrier. While full_pimd_ameoba was used here as an example, this
seems to be reproducible with a majority of the tests in the amber10/test/PIMD
directory. The following is a full accounting of the various PIMD tests. Those
marked with STALL suffered from the issue described above. Also frequently
encountered were errors complaining of missing sander.LES executables, which I
didn't care to track down (any ideas on this one? sounds like a build issue?)
full_pimd_nhc_water/Run.full_pimd_nhc pass
part_nmpimd_helium/Run.nmpimd pass
FAIL with missing sander.LES
full_cmd_water/restart/Run.full_cmd STALL
full_cmd_water/start/Run.full_cmd STALL
full_cmd_water/equilib/Run.full_cmd STALL
part_rpmd_water/Run.rpmd STALL
full_pimd_ln_water/Run.full_pimd_ln pass
part_cmd_water/restart/Run.cmdyn FAIL with missing sander.LES
part_cmd_water/start/Run.cmdyn FAIL with missing sander.LES
part_cmd_water/equilib/Run.cmdyn STALL
full_pimd_ntp_water/Run.full_pimd_ntp pass followed by
FAIL with missing sander.LES
part_pimd_helium/Run.pimd FAIL with missing sander.LES
part_pimd_water/Run.pimd can't run in parallel
part_nmpimd_water/Run.nmpimd can't run in parallel
part_nmpimd_ntp/Run.nmpimd FAIL with missing sander.LES
full_rpmd_water/Run.full_rpmd pass
part_pimd_spcfw/Run.pimd STALL
full_pimd_amoeba/Run.full_pimd_amoeba STALL
full_nmpimd_water/Run.full_pimd STALL
Looking back on previous mailing list traffic, it seems that some issues[1][2]
were identified with the PIMD tests last year. I am uncertain as to whether
this is relevant to the issue at hand but felt it was appropriate to mention.
Anyways, any help you could provide in tracking down this issue would be most
appreciated.
Thanks,
- Ben Gamari
[1]
http://dev-archive.ambermd.org/200809/0011.html
[2]
http://dev-archive.ambermd.org/200709/0009.html
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Dec 08 2009 - 19:30:02 PST