Hi all,
I have installed AMBER7 with MPICH on a Linux RedHat-7.2 cluster. To validate my installation, I run a script that has already been run on AMBER7 installed on a 1 CPU SGI server. This script includes several successive commands, including calls to sander. The script has the following structure:
-----------
#!/bin/csh -f
setenv $AMBERHOME /data/test/amber7
setenv MPICH_HOME /usr/share/mpi
setenv DO_PARALLEL "$MPICH_HOME/bin/mpirun -np 4 -machinefile $MPICH_HOME/share/machines.LINUX"
$AMBERHOME/exe/sander -O \
-i mini1.in \
-o test1.out \
-p test.top \
-c test.crd \
-inf test1.info \
-r test1.rst
$AMBERHOME/exe/sander -O \
-i mini2.in \
(...)
-----------
It seems that sander crashes with the following error messages, whenever I try to run such a script:
-----------
Unit 5 Error on OPEN: mini1.in
[0] MPI Abort by user Aborting program !
[0] Aborting program!
p0_3492: p4_error: : 1
Unit 5 Error on OPEN: mini2.in
[0] MPI Abort by user Aborting program !
[0] Aborting program!
p0_3493: p4_error: : 1
(...)
-----------
As I already said above, a researcher in our molecular modeling team tried to run the same test files on an SGI machine where I previously installed AMBER7 locally without MPICH and it worked fine. Is there a problem with our input files? Is there a difference in the input files for AMBER7 when you runit on 1 or on several processors?
Today, what I'm sure about is that "make test.sander" passed without any problem on the cluster. I don't know wether MPICH is correctly configured or not, but I think it is, because of some tests I have successfully made (see below).
Can one tell me what is a "Unit 5 error", and how I can manage it so that sander runs normally with all the processors I define in the machinefile?
We also experienced sander-crashes problems with "Unit 6 error" that seemed to be related to ".out" files. Has anyone any information about this too?
Here are some informations about the machines and the tests I ran to validate my MPICH module. Maybe it will help you have an idea of what is happening:
-----------
IBM x330series - Linux RedHat-7.2
Test of parallel computing using "mpich-1.2.0" installed from RedHat's RPMs.
$MPICH_HOME=/usr/share/mpi
DO_PARALLEL="$MPICH_HOME/bin/mpirun -np 4 -machinefile $MPICH_HOME/share/machines.LINUX"
MPICH Machinefile is "machines.LINUX" and contains 4 lines formatted that way:
machine2.ourdomain
machine2.ourdomain
machine1.ourdomain
machine1.ourdomain
"machine2" and "machine1" are biprocessors nodes in my cluster
The /data/test directory is a local directory on "machine1" and is NFS-mounted on "machine2" where /data/test is also the name of the mountpoint.
User "me" owns $MPICH_HOME directory (and all of its contents).
User "me" also owns /data/test directory (and all of its contents, including the "cpi" executable file).
Command line used and associated results look like this:
<me_at_machine1:/data/test>/usr/share/mpi/bin/mpirun -np 4 -machinefile /usr/share/mpi/share/machines.LINUX ./cpi
Process 0 on machine1.ourdomain
Process 3 on machine1.ourdomain
Process 1 on machine2.ourdomain
Process 2 on machine2.ourdomain
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.001346
-----------
Thanks in advance to all those who will help me.
Vincent.
---------------------------------------------------------------------
Vincent Bosquier
IT Engineer
Synt:em
Computational Drug Discovery
Parc Scientifique G.Besse
Allee Charles Babbage
30035 Nimes Cedex 1
France
E-mail: vbosquier_at_syntem.com
Ligne directe: +33 (0)466 042 294
Standard: +33 (0)466 048 666
Fax: +33 (0)466 048 667
---------------------------------------------------------------------
Discover New Drugs, Discover Synt:em
http://www.syntem.com
---------------------------------------------------------------------
Received on Mon May 06 2002 - 08:50:07 PDT