RE: [AMBER] MPI process terminated unexpectedly after cluster upgrade

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 2 Nov 2009 15:36:48 -0800

Are you certain it is linking to the correct version of infiniband?

Make sure you do the following:

I assume this is sander but similar instructions should be followed for
pmemd.

1) run > which mpif90

   Check that it is the path you expect. Check that it is the same path as
mpirun. Also check that the compute nodes use the same mpirun.

2) cd $AMBERHOME/src/
3) make clean
4) Update your MPI_HOME to point to the NEW mpi location
5) ./configure -mvapich ifort
6) make parallel
7) Run the test suite in parallel and see if this works - probably easiest
to request an interactive session on your cluster and then set DO_PARALLEL
to the correct run command. E.g. "mpirun -np 8 -machinefile $PBS_NODEFILE "
and cd $AMBERHOME/test/; make test.parallel

If this crashes then I would check to make sure the new MVAPICH is actually
working properly. There should be a test suite with it that checks it is
working. Is it definitely using the correct version, e.g. the 64 bit version
on x86_64?

Note, if you just recompiled without making clean and without building a new
config_amber.h file and updating your MPI_HOME then it likely has been built
with a mix of the old and new versions of MPI which is probably what is
causing your problems.

Also make sure you are up to date on all the bugfixes.

All the best
Ross

> -----Original Message-----
> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> Behalf Of Dmitri Nilov
> Sent: Monday, November 02, 2009 5:11 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] MPI process terminated unexpectedly after cluster
> upgrade
>
> Yes, I've recompiled Amber but I couldn't change mvapich because I'm
> just
> client on serious cluster)
>
> On Mon, Nov 2, 2009 at 3:15 PM, Jason Swails <jason.swails.gmail.com>
> wrote:
>
> > It could be that the new version of mvapich broke the previous
> > installation,
> > since the libraries could easily have changed (and if it's really, in
> fact,
> > a new version, I'd bet on it since there's not much else that could
> > 'change'). Did you try recompiling?
> >
> > Do the test cases still pass? If not, I'd say your only options are
> to
> > recompile amber/pmemd in parallel or revert back to the old version
> of
> > mvapich if it's still on the cluster.
> >
> > Good luck!
> > Jason
> >
> > On Mon, Nov 2, 2009 at 4:17 AM, Dmitri Nilov <nilovdm.gmail.com>
> wrote:
> >
> > > Hello!
> > > Sander.MPI tasks are crushing just after launch since mvapich
> software
> > was
> > > upgraded on cluster.
> > > Sander.MPI.out contains:
> > >
> > > MPI process terminated unexpectedly
> > > Exit code -5 signaled from node-23-06
> > > Killing remote processes...forrtl: error (69): process interrupted
> > (SIGINT)
> > > Image PC Routine Line
> > > Source
> > > libpthread.so.0 00007F2132C1EB00 Unknown Unknown
> > Unknown
> > > libpthread.so.0 00007F2132C1DB7E Unknown Unknown
> > Unknown
> > > libmpich.so.1.0 00007F21334CB1AC Unknown Unknown
> > Unknown
> > > libmpich.so.1.0 00007F21334E1ADE Unknown Unknown
> > Unknown
> > > libmpich.so.1.0 00007F21334C050A Unknown Unknown
> > Unknown
> > > libmpich.so.1.0 00007F21334A2DED Unknown Unknown
> > Unknown
> > > libmpich.so.1.0 00007F21334A1DC6 Unknown Unknown
> > Unknown
> > > sander.MPI 000000000093A0EF Unknown Unknown
> > Unknown
> > > sander.MPI 00000000004BC222 Unknown Unknown
> > Unknown
> > > sander.MPI 000000000041E05C Unknown Unknown
> > Unknown
> > > libc.so.6 00007F213216ACF4 Unknown Unknown
> > Unknown
> > > sander.MPI 000000000041DF69 Unknown Unknown
> > Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > and so on..
> > >
> > > I've found similar problem at
> > > http://archive.ambermd.org/200907/0092.html, that seems to be still
> > > unsolved.
> > > I don't think it's infiniband problem. So what i have to do?
> > >
> > > Thanks a lot!
> > > Dmitri Nilov,
> > > Lomonosov Moscow State University
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> >
> >
> > --
> > ---------------------------------------
> > Jason M. Swails
> > Quantum Theory Project,
> > University of Florida
> > Ph.D. Graduate Student
> > 352-392-4032
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Nov 02 2009 - 16:00:06 PST
Custom Search