Re: [AMBER] MPI process terminated unexpectedly after cluster upgrade

From: Dmitri Nilov <nilovdm.gmail.com>
Date: Tue, 3 Nov 2009 16:20:17 +0300

Yes, I've followed all these instructions. Program is
Amber10/Sander.MPI. Serial tests are OK. Most of parallel test cases are
finished with "possible FAILURE: check *.dif", and corresponding
sander.MPI.out files contain similar error.
What test cases are most appropriate to analyse outputs?

> ./configure -mvapich ifort
I suppose that it means ./configure_amber -mpich ifort?

I don't suppose there could be serious mistakes in infiniband or mvapich
installation.

Thanks!
Dmitri Nilov,
Lomonosov Moskow State University

On Tue, Nov 3, 2009 at 2:36 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Are you certain it is linking to the correct version of infiniband?
>
> Make sure you do the following:
>
> I assume this is sander but similar instructions should be followed for
> pmemd.
>
> 1) run > which mpif90
>
> Check that it is the path you expect. Check that it is the same path as
> mpirun. Also check that the compute nodes use the same mpirun.
>
> 2) cd $AMBERHOME/src/
> 3) make clean
> 4) Update your MPI_HOME to point to the NEW mpi location
> 5) ./configure -mvapich ifort
> 6) make parallel
> 7) Run the test suite in parallel and see if this works - probably easiest
> to request an interactive session on your cluster and then set DO_PARALLEL
> to the correct run command. E.g. "mpirun -np 8 -machinefile $PBS_NODEFILE "
> and cd $AMBERHOME/test/; make test.parallel
>
> If this crashes then I would check to make sure the new MVAPICH is actually
> working properly. There should be a test suite with it that checks it is
> working. Is it definitely using the correct version, e.g. the 64 bit
> version
> on x86_64?
>
> Note, if you just recompiled without making clean and without building a
> new
> config_amber.h file and updating your MPI_HOME then it likely has been
> built
> with a mix of the old and new versions of MPI which is probably what is
> causing your problems.
>
> Also make sure you are up to date on all the bugfixes.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> > Behalf Of Dmitri Nilov
> > Sent: Monday, November 02, 2009 5:11 AM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] MPI process terminated unexpectedly after cluster
> > upgrade
> >
> > Yes, I've recompiled Amber but I couldn't change mvapich because I'm
> > just
> > client on serious cluster)
> >
> > On Mon, Nov 2, 2009 at 3:15 PM, Jason Swails <jason.swails.gmail.com>
> > wrote:
> >
> > > It could be that the new version of mvapich broke the previous
> > > installation,
> > > since the libraries could easily have changed (and if it's really, in
> > fact,
> > > a new version, I'd bet on it since there's not much else that could
> > > 'change'). Did you try recompiling?
> > >
> > > Do the test cases still pass? If not, I'd say your only options are
> > to
> > > recompile amber/pmemd in parallel or revert back to the old version
> > of
> > > mvapich if it's still on the cluster.
> > >
> > > Good luck!
> > > Jason
> > >
> > > On Mon, Nov 2, 2009 at 4:17 AM, Dmitri Nilov <nilovdm.gmail.com>
> > wrote:
> > >
> > > > Hello!
> > > > Sander.MPI tasks are crushing just after launch since mvapich
> > software
> > > was
> > > > upgraded on cluster.
> > > > Sander.MPI.out contains:
> > > >
> > > > MPI process terminated unexpectedly
> > > > Exit code -5 signaled from node-23-06
> > > > Killing remote processes...forrtl: error (69): process interrupted
> > > (SIGINT)
> > > > Image PC Routine Line
> > > > Source
> > > > libpthread.so.0 00007F2132C1EB00 Unknown Unknown
> > > Unknown
> > > > libpthread.so.0 00007F2132C1DB7E Unknown Unknown
> > > Unknown
> > > > libmpich.so.1.0 00007F21334CB1AC Unknown Unknown
> > > Unknown
> > > > libmpich.so.1.0 00007F21334E1ADE Unknown Unknown
> > > Unknown
> > > > libmpich.so.1.0 00007F21334C050A Unknown Unknown
> > > Unknown
> > > > libmpich.so.1.0 00007F21334A2DED Unknown Unknown
> > > Unknown
> > > > libmpich.so.1.0 00007F21334A1DC6 Unknown Unknown
> > > Unknown
> > > > sander.MPI 000000000093A0EF Unknown Unknown
> > > Unknown
> > > > sander.MPI 00000000004BC222 Unknown Unknown
> > > Unknown
> > > > sander.MPI 000000000041E05C Unknown Unknown
> > > Unknown
> > > > libc.so.6 00007F213216ACF4 Unknown Unknown
> > > Unknown
> > > > sander.MPI 000000000041DF69 Unknown Unknown
> > > Unknown
> > > > forrtl: error (69): process interrupted (SIGINT)
> > > > and so on..
> > > >
> > > > I've found similar problem at
> > > > http://archive.ambermd.org/200907/0092.html, that seems to be still
> > > > unsolved.
> > > > I don't think it's infiniband problem. So what i have to do?
> > > >
> > > > Thanks a lot!
> > > > Dmitri Nilov,
> > > > Lomonosov Moscow State University
> > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > >
> > >
> > >
> > > --
> > > ---------------------------------------
> > > Jason M. Swails
> > > Quantum Theory Project,
> > > University of Florida
> > > Ph.D. Graduate Student
> > > 352-392-4032
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Nov 03 2009 - 05:30:02 PST
Custom Search