Hello,
For a long time I am struggling to make use of ES40 (Compaq)
machines to use AMBER6. I could get good speed, the best I could get here
for amber6 benchmark, one these. The problem was the deviation of the
numbers from the expected values.
For example, with one processor on ES40s/16 processors on IBM SP3/24
processors on Linux clusters/Sun E-420R with 4 processors, after the
benchmark (100 steps), I got identical mdinfo files. With more than one
processor on ES40, the numbers are wrong!
To show the seriousness of the problem, the following are the energies for
a correct mdout followed by that from ES40s with 4 processors:
==========================================================================
+++ Correct +++
NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00
Etot = -57716.6183 EKtot = 14145.7439 EPtot = -71862.3622
BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542
1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892
EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000
Ewald error estimate: 0.3783E-04
+++ ES40 with 4 processors
NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00
Etot = -57716.6161 EKtot = 14145.7461 EPtot = -71862.3622
BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542
1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892
EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000
Ewald error estimate: 0.3783E-04
==========================================================================
Etot differs at second decimal by 1st step!
For a long time I suspected something wrong with the compaq
dxml/fortran/mpi libraries.
Finally, I decided to pinpoint the problem, and tried MPICH instead of
compaq MPI. MPICH passed all the tests showing things were fine, but still
gave the same numbers. So COMPAQ MPI (probably) is not a problem.
Then I excluded COMPAQ mathematical libraries and dxml and used the ones
that come with amber. That also did not solve the problem. So COMPAQ DXML
is also not the culprit.
I played around with as many fortrans and their options to make the thing
work, but it made no difference.
Putting all these in perspective, I reluctantly concluded that the problem
is equally likely with amber6, not just compaq.
Can anyone help me sorting with the problem. I tried looking into code, I
couldn't follow much as I am not acquainted with MPI programming. The
following are some more clues and notes:
NOTE:
1. The problem is CONFINED (dEKtot=0.0022) to the calculation of EKtot
alone, as it looks like, with which I am surprised. (Others to follow
suite are temperatures of course. This was verified even from mden files
at the end of 1st step).
2. System tested was DHFR (benchmark), a part of amber test suite.
3. Numbers change with the number of processors. Even if I spawn more
threads onto single processor, I have the same problem. So, this is
indicative of MPI problems with COMPAQ more than of amber.
4. Since the same numbers are reproduced on ES40s elsewhere as well, it is
not a problem with installation.
5. Since DS20s does not give the problem mentioned above, it means,
probably, COMPAQ MPIs are still suspectible.
6. Discripancies of the order above at each step would mean very serious
problems. If overall parameters are statistically similar, it might as
well mean due to the energy landscape robustness and luck and nothing
more.
Thanks for any help,
Sincerely,
-Sanjeev
Received on Tue Mar 19 2002 - 23:01:17 PST