[AMBER] softcore TI with more than 2 procs?

From: Niel Henriksen <niel.henriksen.utah.edu>
Date: Mon, 7 Feb 2011 11:40:26 -0700

Hello,
I am performing softcore TI on a ligand-receptor complex in a similar manner as that of tutorial A9. It seems that when ifsc=1, I cannot run the calculation with more than two processors. Is this to be expected? (I don't see mention of such a limitation in the manual or on the mailing list.)

To start the job I do something like:
mpirun_rsh -ssh -np 24 -hostfile $PBS_NODEFILE $AMBERHOME/bin/sander.MPI -ng 2 -groupfile heat.grp.in
or
aprun -n 24 $AMBERHOME/bin/sander.MPI -ng 2 -groupfile heat.grp.in

If I use "-np 2" then everything runs fine. If I try more processors ("-np 4" or more) then it crashes.

I have compiled two versions of amber11:
- Fully patched amber11 on an Intel dual six-core cluster with Intel compilers 11.1.038, MKL and mvapich2 1.5.1
- Fully patched amber11 on Cray XT5 (kraken) using pgi compilers vers 10.6.0, GOTO, and MPT 5.0.0

The tests pass normally. Sander.MPI runs normally when in mutisander mode for regular MD simulations. TI calculations with ifsc=0 also run normally, even with np > 2. However, when using np > 2 and ifsc=1, the jobs seem to initialize ok, but then crash on the first calculation step.

The pgi version gives this error:
Fatal error in MPI_Sendrecv: Invalid communicator, error stack:
MPI_Sendrecv(218): MPI_Sendrecv(sbuf=0x14039a0, scount=3, MPI_DOUBLE_PRECISION, dest=-32765, stag=5, rbuf=0x14039c0, rcount=3, MPI_DOUBLE_PRECISION, src=-32765, rtag=5, MPI_COMM_NULL, status=0x18
12b70) failed
MPI_Sendrecv(89).: Null communicator

The Intel version (with -g -traceback) gives these errors:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmpich.so.1.2 00002B8142492FD6 Unknown Unknown Unknown
libmpich.so.1.2 00002B81424D47B4 Unknown Unknown Unknown
libmpich.so.1.2 00002B81424D4680 Unknown Unknown Unknown
sander.MPI 0000000000666438 softcore_mp_sc_mi 866 _softcore.f
sander.MPI 00000000004F0BD5 runmd_ 1132 _runmd.f
sander.MPI 00000000004B20B6 Unknown Unknown Unknown
sander.MPI 00000000004AA4DF Unknown Unknown Unknown
sander.MPI 000000000040F5BC Unknown Unknown Unknown
libc.so.6 00002B814327D994 Unknown Unknown Unknown
sander.MPI 000000000040F4C9 Unknown Unknown Unknown
....repeat.....
....repeat.....
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
libsvml.so 00002B0B47D78D20 Unknown Unknown Unknown
sander.MPI 000000000052A73B ew_bspline_mp_loa 568 _ew_bspline.f
sander.MPI 000000000051D452 ew_startup_ 1543 _ew_setup.f
sander.MPI 0000000000557870 nblist_mp_nonbond 576 _nonbond_list.f
sander.MPI 000000000073729F force_ 959 _force.f
sander.MPI 00000000004F469F runmd_ 1262 _runmd.f
sander.MPI 00000000004B20B6 Unknown Unknown Unknown
sander.MPI 00000000004AA4DF Unknown Unknown Unknown
sander.MPI 000000000040F5BC Unknown Unknown Unknown
libc.so.6 00002B0B493B9994 Unknown Unknown Unknown
sander.MPI 000000000040F4C9 Unknown Unknown Unknown

Line 866 of _softcore.f:
  call mpi_sendrecv( vcm, 3, MPI_DOUBLE_PRECISION, partner, 5, &
                         vcm_partner, 3, MPI_DOUBLE_PRECISION, partner, 5, &
                         commmaster, ist, ierr )

Line 568 of _ew_bspline.f:
  call dftmod(bsp_mod,bsp_arr,nfft2)


Any ideas about this? Thanks for the help,
--Niel


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Feb 07 2011 - 11:00:05 PST
Custom Search