RE: sander mpich hangs

From: Mark Hsieh <hsieh_at_abmaxis.com>
Date: Fri 26 Apr 2002 12:08:48 -0700

I am able to rsh and rlogin to all the nodes without having to
type a password. There is a .rhosts file containing the allowed
machine/user combination on each node.

The problems seems intermittent. Fewer the processors, the more
likely the mpirun sander is to start. So -np 1, 2, ... seem to work
most of the time but more processors seem to hang the sander
process(es) more frequently and never generate an output file.
I can rsh to see hung sander processes on each node. When I kill
the initiating mpirun, those waiting sander processes also disappear,
so at leat the cleanup seems okay.

I tried using the mpirun command option "-p4pg" with a processor group
file and that seems to help with the larger number of processors but I
can't be certain that it's not something else that fixed the problem
for the current run.

I tried mpich/sbin/tstmachines which gave inconsistent results. At
one time it will complete without error messages. More frequently,
it will give error messages regarding inability to rsh or inconsistent
file system problems but that does not seem to correlate with whether
an mpirun sander will start or not on the cluster.

It would be nice to know if there are some way to output debugging
or error messages either for mpich/mpirun or sander.

Thanks,
Mark

> -----Original Message-----
> From: Stéphane Teletchéa [mailto:steletch_at_biomedicale.univ-paris5.fr]
> Sent: Friday, April 26, 2002 4:54 AM
> To: Mark Hsieh; amber_at_heimdal.compchem.ucsf.edu
> Subject: Re: sander mpich hangs
>
>
> Le Vendredi 26 Avril 2002 00:01, Mark Hsieh a écrit :
> > Hi,
> >
> > I'm trying to set up amber 7 to run on a linux cluster
> > running RedHat 6.2. Rsh version of MPICH-1.2.3 and
> > amber7 compiled and installed correctly.
> >
> > I'm trying to get sander to run on a single dual-cpu node
> > but a sander proces starts, taking up 99% of one CPU, without
> > generating any output, although an empty file is created.
> >
> > Eventually, the process will say "Connection refused."
> > Rsh is working properly on the node itself. Does sander
> > have problems running under RH6.2?
> >
> > A similar mpich/amber setup on a dual-cpu RedHat 7.1
> > system works fine.
> >
> > Mark
>
> Can you do a rllogin on this machine without being prompted for a
> password ?
> If not, you must define a .rhosts in your home containing the
> name of allowed
> machines.
> Stef
> --
> *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
> Teletchéa Stéphane - CNRS UMR 8601
> Lab. de chimie et biochimie pharmacologiques et toxicologiques
> 45 rue des Saints-Peres 75270 Paris cedex 06
> tel : (33) - 1 42 86 20 86 - fax : (33) - 1 42 86 83 87
> mél : steletch_at_biomedicale.univ-paris5.fr
> *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
>
>
Received on Fri Apr 26 2002 - 12:08:48 PDT
Custom Search