Dear AMBER users,
I have compiled AMBER 7 and MPICH 1.2.4 on a Suse-Linux
Cluster (1 single CPU master and 2 dual SMP nodes so far)
connected with a 100Mbit switch.
The /home /amber and /mpich directories are present on the
master only and nfs mounted to the nodes. I can rsh and rsh
true to and from all of my nodes.
After minor problems (see my previous posts :-) I got
Sander running. However, my System shows a strange
behaviour, when I increase the number of nodes to run my
calculation on:
My test case is the DNA_invacuo tutorial from the
AMBER-homepage, a short Sander MD-Run.
It takes about 120 sec. when run on the master alone or on
one of the nodes alone (using mpirun -np 1 -nolocal)
It takes about 70 sec when run on both CPUs of one of the
dual nodes (using mpirun -np 2 -nolocal)
And it takes about 110 sec. when run on two CPUs from
different computers.
But when I try to use 4 nodes, the calculation takes about
100 sec when started with the -nolocal option and never
finishes at all, when I run it with 4 CPUs without
-nolocal.
When I stop the calculation the error message:
rm_l_?_???: (???.?????) net_send: could not write to fd=5
errno: 104
appears.
the ? represent varying numbers.
My machines.LINUX file contains:
master
node1:2
node2:2
and the PI files created by mpirun look as expected.
The mpi cpi test program runs fine with 1-5 processors.
I have tried to increase $P4_GLOBMEMSIZE to 10000000 as
mentioned in older posts, but nothing changed.
I suspect something with my network is wrong (because the
run on two separate CPUs takes so much longer than on a
Dual machine) but I have no clue what the error message
means.
Has someone of you experienced similar problems or some
hints what I should look for?
Sorry for the long posting, I'm never sure what information
might be important and what not.
Kind Regards,
Thomas
Received on Tue Jul 30 2002 - 08:09:03 PDT