Dear Reflector,
We are currently testing amber 9 on a new machine. We are having
problems with the MPI communications and I was wondering if there are
any known compatibility issues with the machine and the way amber is
compiled before we start looking at hardware and driver issues, any
comments or help would be much appreciated.
The basic specks of the machine are as follows:
128 compute nodes, each with two quad-core Intel Harpertown 3.0 GHz
processors, for a total of 1024 cores;
Voltaire 20 Gbit/s InfiniBand fabric used both to share files thru
GPFS and to run MPI jobs.
11:07:15 cal2 root - /root > rpmg kernel
kernel-smp-2.6.16.46-0.12
kernel-ib-devel-1.3-2.6.16.46_0.12_smp.volt2986
kernel-smp-2.6.16.54-0.2.5
kernel-ib-1.3-2.6.16.46_0.12_smp.volt2986
kernel-source-2.6.16.46-0.12
kernel-source-2.6.16.54-0.2.5
We have successfully compiled amber 9 using openmpi/1.2.6_gcc-4.1.2
and intel fortan and c++ compilers. We ran the tests without
problems, however, when scaling jobs to use 128-256 cpus we encounter
MPI problems. The error is the following:
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
Thanks in advance.
----------------------------------------------------------------------------------------------------------------------------------
Dr Geoffrey Wood
Ecole Polytechnique Fédérale de Lausanne
http://lcbcpc21.epfl.ch/Group_members/geoff/
SB - ISIC - LCBC
BCH
4108
tel: +41 21 693 03 23
CH - 1015 Lausanne e-
mail: geoffrey.wood.epfl.ch
----------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
to majordomo.scripps.edu
Received on Sun Aug 17 2008 - 06:07:04 PDT