sander mpirun hangs with 4 CPU, but not with 2

From: Thomas Steinbrecher <thomas.steinbrecher_at_physchem.uni-freiburg.de>
Date: Tue 30 Jul 2002 17:09:03 +0200

Dear AMBER users,

I have compiled AMBER 7 and MPICH 1.2.4 on a Suse-Linux
Cluster (1 single CPU master and 2 dual SMP nodes so far)
connected with a 100Mbit switch.
The /home /amber and /mpich directories are present on the
master only and nfs mounted to the nodes. I can rsh and rsh
true to and from all of my nodes.

After minor problems (see my previous posts :-) I got
Sander running. However, my System shows a strange
behaviour, when I increase the number of nodes to run my
calculation on:

My test case is the DNA_invacuo tutorial from the
AMBER-homepage, a short Sander MD-Run.

It takes about 120 sec. when run on the master alone or on
one of the nodes alone (using mpirun -np 1 -nolocal)

It takes about 70 sec when run on both CPUs of one of the
dual nodes (using mpirun -np 2 -nolocal)

And it takes about 110 sec. when run on two CPUs from
different computers.

But when I try to use 4 nodes, the calculation takes about
100 sec when started with the -nolocal option and never
finishes at all, when I run it with 4 CPUs without
-nolocal.

When I stop the calculation the error message:

rm_l_?_???: (???.?????) net_send: could not write to fd=5
errno: 104

appears.

the ? represent varying numbers.

My machines.LINUX file contains:
master
node1:2
node2:2

and the PI files created by mpirun look as expected.

The mpi cpi test program runs fine with 1-5 processors.

I have tried to increase $P4_GLOBMEMSIZE to 10000000 as
mentioned in older posts, but nothing changed.

I suspect something with my network is wrong (because the
run on two separate CPUs takes so much longer than on a
Dual machine) but I have no clue what the error message
means.

Has someone of you experienced similar problems or some
hints what I should look for?

Sorry for the long posting, I'm never sure what information
might be important and what not.

Kind Regards,

Thomas
Received on Tue Jul 30 2002 - 08:09:03 PDT
Custom Search