Hello,
        For a long time I am struggling to make use of ES40 (Compaq) 
machines to use AMBER6. I could get good speed, the best I could get here 
for amber6 benchmark, one these. The problem was the deviation of the 
numbers from the expected values.
For example, with one processor on ES40s/16 processors on IBM SP3/24
processors on Linux clusters/Sun E-420R with 4 processors, after the
benchmark (100 steps), I got identical mdinfo files. With more than one
processor on ES40, the numbers are wrong!
To show the seriousness of the problem, the following are the energies for 
a correct mdout followed by that from ES40s with 4 processors:
==========================================================================
+++ Correct +++
 NSTEP =     1  TIME(PS) =  510.051  TEMP(K) =   302.02  PRESS =      0.00
 Etot   =  -57716.6183  EKtot   =   14145.7439  EPtot      =  -71862.3622
 BOND   =     452.1690  ANGLE   =    1277.0334  DIHED      =     968.3542
 1-4 NB =     545.9440  1-4 EEL =    6666.3920  VDWAALS    =    8109.3892
 EELEC  =  -89881.6441  EHBOND  =       0.0000  CONSTRAINT =       0.0000
 Ewald error estimate:   0.3783E-04
+++ ES40 with 4 processors
 NSTEP =     1  TIME(PS) =  510.051  TEMP(K) =   302.02  PRESS =      0.00
 Etot   =  -57716.6161  EKtot   =   14145.7461  EPtot      =  -71862.3622
 BOND   =     452.1690  ANGLE   =    1277.0334  DIHED      =     968.3542
 1-4 NB =     545.9440  1-4 EEL =    6666.3920  VDWAALS    =    8109.3892
 EELEC  =  -89881.6441  EHBOND  =       0.0000  CONSTRAINT =       0.0000
 Ewald error estimate:   0.3783E-04
==========================================================================
Etot differs at second decimal by 1st step!
For a long time I suspected something wrong with the compaq 
dxml/fortran/mpi libraries. 
Finally, I decided to pinpoint the problem, and tried MPICH instead of 
compaq MPI. MPICH passed all the tests showing things were fine, but still 
gave the same numbers. So COMPAQ MPI (probably) is not a problem.
Then I excluded COMPAQ mathematical libraries and dxml and used the ones 
that come with amber. That also did not solve the problem. So COMPAQ DXML 
is also not the culprit.
I played around with as many fortrans and their options to make the thing 
work, but it made no difference.
Putting all these in perspective, I reluctantly concluded that the problem 
is equally likely with amber6, not just compaq.
Can anyone help me sorting with the problem. I tried looking into code, I 
couldn't follow much as I am not acquainted with MPI programming. The 
following are some more clues and notes:
NOTE: 
1. The problem is CONFINED (dEKtot=0.0022) to the calculation of EKtot 
alone, as it looks like, with which I am surprised. (Others to follow 
suite are temperatures of course. This was verified even from mden files 
at the end of 1st step).
2. System tested was DHFR (benchmark), a part of amber test suite.
3. Numbers change with the number of processors. Even if I spawn more 
threads onto single processor, I have the same problem. So, this is 
indicative of MPI problems with COMPAQ more than of amber.
4. Since the same numbers are reproduced on ES40s elsewhere as well, it is 
not a problem with installation.
5. Since DS20s does not give the problem mentioned above, it means, 
probably, COMPAQ MPIs are still suspectible.
6. Discripancies of the order above at each step would mean very serious 
problems. If overall parameters are statistically similar, it might as 
well mean due to the energy landscape robustness and luck and nothing 
more. 
Thanks for any help,
Sincerely,
-Sanjeev
Received on Tue Mar 19 2002 - 23:01:17 PST