Dear Jianhui,
It seems that you are not following the suggested format for the file
MACHINES.Linux. As you can read in the manual, you don't have to use the
FQDN (Fully qualified domain name) but the machine name. As an example, look
at our MACHINES.Linux file:
$ cat /usr/local/mpich-1.2.3/share/machines.LINUX
n1
n2
In your case, with two-processor nodes, just add a :2 after each hostname:
n1:2
n2:2
Also, and since you might not have a DNS server resolving these short
names, you must be sure to add them to the file /etc/hosts.
Again, look at our /etc/hosts:
$ cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.1 n1
192.168.1.2 n2
To make sure your system is resolving the names you can use the network
tool ping.
$ ping n1
If you get an error like "unknown host", something is wrong. Otherwise,
it's okay.
I'd suggest you to look at the file /etc/host.conf and /etc/resolv.conf, as
they have information about resolving priorities and other important
parameters.
Behind all of this, check if you want to use rsh or ssh. From my point of
view, I would first make sure that rsh is working, and then move to ssh.
Hope it will help you.
-------------------------------------------------------------
Joffre Heredia Rodrigo Tel: (34)-93-5813812
Laboratory of Computational Medicine Fax: (34)-93-5812344
Biostatistic Dept.
UAB School of Medicine. Bellaterra Joffre.Heredia_at_uab.es
08193-Barcelona (SPAIN)
-------------------------------------------------------------
On Wed, 15 May 2002, Jianhui Wu wrote:
> Dear Amber Linux Cluster users,
>
> I try to compile and run Sander parallel version on a Linux Cluster. Until
> now, I can only manage to run Sander (parallel version, amber7) on single
> processor. Basically, I don't know how to setup the parallel computing
> environment to run sander job with multiple processors. Can someone give
> me a hand or point me to some useful instruction for my system?
>
> Here are what I have done.
>
> (1) Sander of amber7 was compiled using pgf77, machine file:
> Machine.pgf77_mpich (download from amber webpage), mpich-1.2.4 installed.
>
> (2) Machines: 15 dual processors SMP Linux cluster (amd3-mosix)
>
> (3) I define the DO_PARALLEL variable as follows.
>
> setenv DO_PARALLEL "$MPICH_HOME/bin/mpirun -np 2 -machinefile
> $MPICH_HOME/util/machines/machines.LINUX"
>
> (4) The files are shared by all nodes and I can rlogin to each node
> without problem.
>
> (5) Problems:
> If mpirun -np 1, then, the test jobs are fine.
> If mpirun -np 2 or above, the sander job aborted with error message.
>
> For example, if I submit the job with mpirun -np 2 at apple.x.y.ca,
> after I define the machine file machines.LINUX as follow,
>
> "apple.x.y.ca" 2
> "cherry.x.y.ca" 2
> ......
>
>
> (a) I got the error message
> ****************************************************************************
> p0_20194: p4_error: Could not gethostbyname for host "apple.x.y.ca"; may
> be invalid name : 61
> **************************************************************************
>
>
> (b) There is a file PI20114 exist after I submit the job. This file
> contain
> --------------------------------------------------
> apple.x.y.ca 0 /home/....../amber7/exe/sander
> "apple.x.y.ca" 1 /home/....../amber7/exe/sander
> -------------------------------------------------
>
> (c) If I change the machine file into
>
> apple.x.y.ca:2
> cherry.x.y.ca:2
> .....
>
>
> I got message:
> **************************************
> Host key not found from the list of known hosts.
> Are you sure you want to continue connecting (yes/no)?
> ****************************************************
>
> I also try to run lamboot at node1-3, define -np 2 and
> run sander again. Similar problem.
>
>
> It seems I don't even get the two processors in the
> same box to work for a single Sander job. As I am new to
> parallel computing, could someone give me some tips as to
> what should I do (install what libray, which software....)
> in order to run Sander job with multiple processors (I have
> 15 dual-processor nodes).
>
> Thanks a lot for your help,
>
> Jian Hui Wu
>
> Lady Davis Insitute
>
Received on Thu May 16 2002 - 11:01:16 PDT