I don't know whether what hapened to me is the same as what's happening to
you, but since I sent in a question last week, and the answer may be
material for an FAQ, here goes:
We also saw a similar hang. We have a cluster which has a worldly node with
a "normal" IP address and name of x00.achs.virginia.edu. Behind it, on a
separate ethernet
interface and switch, are 20 additional nodes with private (192.168.x.x) IP
addresses. The nodes are known to each other as x00 through x20 (no
".achs.virginia.edu" appended).
It turns out that we had to add "x00.achs.virginia.edu" as an alias to "x00"
in the /etc/hosts tables on x01 through x20. Apparently, when MPI starts up
on the x00, it sends its "hostname" to the other nodes, namely
"x00.achs.virginia.edu". They need to be able to resolve this name to send
back to the master.
Note that the switching of entries in your machinefile is really a red herri
ng here, since without the "-nolocal" option on the mpirun command, you get
your first (and only, if np=1) instance on the localhost.
Hope this helps.
Tom Spraggins
tas_at_virginia.edu
-----Original Message-----
From: Walter Langel [mailto:langel_at_mail.uni-greifswald.de]
Sent: Monday, July 08, 2002 8:27 AM
To: root; amber_at_heimdal.compchem.ucsf.edu
Subject: Re: installing MPICH for AMBER on Linux CLuster
Hi,
could you pass the output from the echo-option
mpirun -echo -np 2 .....
to us, and better do not run programs as root.
Regards
Walter Langel
root wrote:
>Greetings,
>
>I am trying to build a linux cluster to run AMBER simulations. While
>installing MPICH I ran across a problem thats troubling me for some days
>now. Perhaps someone of you knows a solution and can help me with it.
>
>My system consists of:
>
>Hardware:
>
> 1 Linux PC (Athlon 1800) with 2 NIC acting as master
>
> 1 Linux PC (SMP dual Athlon 1600) acting as node (more of those to come
> when the system runs)
>
> 1 allied telsyn switch connecting the computers
>
>Software:
>
> Suse Linux 8.0 (kernel 2.2.13) installed on both machines
> the nodes home directory and mpich-directory are nfs-mounted
> (nfs version 2) from the master
>
> I added the following modifictions:
>
> I allowed passwordless rsh login between all computers (the tstmachines
> script of mpich worked without errors, I also tried rsh host true
> with all of them)
>
> I installed MPICH-1.2.4 with options device=ch_p4 and comm=shared
> (I tried without the options first, but the problem stayed the
> same)
>
> I set up a machines.LINUX file with
>
> > master
> > node1:2
>
>Problem:
>
> When I try to run the cpi testprogram with mpirun, it fails when I
> try to use processors from both machines, that is:
>
> mpirun -np 1 /examples/basic/cpi
>
> runs without problem
>
> mpirun -np 2 /examples/basic/cpi
>
> hangs after creating the PI-file:
>
> > running /usr/local/mpich-1.2.4/examples/basic/cpi on 2 LINUX
> > ch_p4 processors
> > Created /home/tom/PI23485
>
> The PI-file is:
> > master 0 /usr/local/mpich-1.2.4/examples/basic/cpi
> > node1 1 /usr/local/mpich-1.2.4/examples/basic/cpi
>
> when I switch the two names in the machine file, it also runs with
> -np 1, but hangs with -np 2.
>
> When I try with -np 3 it also hangs, the PI-file is:
>
> > pc2-117 0 /usr/local/mpich-1.2.4/examples/basic/cpi
> > node1 2 /usr/local/mpich-1.2.4/examples/basic/cpi
>
>I'm afraid as a newbie to Linux I cannot solve this alone. I didn't find
>hints on this problem in the MPICH or AMBER mail archives or
>documentations, partially because I don't know exactly what I'm looking
for.
>
>Please mail if anyone has a clue what to try next.
>
>Kind regards,
>
>Thomas
>
>
>
--
Prof. Dr. Walter Langel
Institut fuer Chemie und Biochemie
Universitaet Greifswald
Soldmannstrasse 23
D-17487 Greifswald
Germany
Tel +49 3834 86 4423
Fax +49 3834 86 4413
http://www.chemie.uni-greifswald.de/~plasma
Received on Mon Jul 08 2002 - 06:15:23 PDT