- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Sanjeev B.S. <sanjeev_at_mbu.iisc.ernet.in>

Date: Wed 20 Mar 2002 12:31:17 +0530 (IST)

Hello,

For a long time I am struggling to make use of ES40 (Compaq)

machines to use AMBER6. I could get good speed, the best I could get here

for amber6 benchmark, one these. The problem was the deviation of the

numbers from the expected values.

For example, with one processor on ES40s/16 processors on IBM SP3/24

processors on Linux clusters/Sun E-420R with 4 processors, after the

benchmark (100 steps), I got identical mdinfo files. With more than one

processor on ES40, the numbers are wrong!

To show the seriousness of the problem, the following are the energies for

a correct mdout followed by that from ES40s with 4 processors:

==========================================================================

+++ Correct +++

NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00

Etot = -57716.6183 EKtot = 14145.7439 EPtot = -71862.3622

BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542

1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892

EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000

Ewald error estimate: 0.3783E-04

+++ ES40 with 4 processors

NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00

Etot = -57716.6161 EKtot = 14145.7461 EPtot = -71862.3622

BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542

1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892

EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000

Ewald error estimate: 0.3783E-04

==========================================================================

Etot differs at second decimal by 1st step!

For a long time I suspected something wrong with the compaq

dxml/fortran/mpi libraries.

Finally, I decided to pinpoint the problem, and tried MPICH instead of

compaq MPI. MPICH passed all the tests showing things were fine, but still

gave the same numbers. So COMPAQ MPI (probably) is not a problem.

Then I excluded COMPAQ mathematical libraries and dxml and used the ones

that come with amber. That also did not solve the problem. So COMPAQ DXML

is also not the culprit.

I played around with as many fortrans and their options to make the thing

work, but it made no difference.

Putting all these in perspective, I reluctantly concluded that the problem

is equally likely with amber6, not just compaq.

Can anyone help me sorting with the problem. I tried looking into code, I

couldn't follow much as I am not acquainted with MPI programming. The

following are some more clues and notes:

NOTE:

1. The problem is CONFINED (dEKtot=0.0022) to the calculation of EKtot

alone, as it looks like, with which I am surprised. (Others to follow

suite are temperatures of course. This was verified even from mden files

at the end of 1st step).

2. System tested was DHFR (benchmark), a part of amber test suite.

3. Numbers change with the number of processors. Even if I spawn more

threads onto single processor, I have the same problem. So, this is

indicative of MPI problems with COMPAQ more than of amber.

4. Since the same numbers are reproduced on ES40s elsewhere as well, it is

not a problem with installation.

5. Since DS20s does not give the problem mentioned above, it means,

probably, COMPAQ MPIs are still suspectible.

6. Discripancies of the order above at each step would mean very serious

problems. If overall parameters are statistically similar, it might as

well mean due to the energy landscape robustness and luck and nothing

more.

Thanks for any help,

Sincerely,

-Sanjeev

Received on Tue Mar 19 2002 - 23:01:17 PST

Date: Wed 20 Mar 2002 12:31:17 +0530 (IST)

Hello,

For a long time I am struggling to make use of ES40 (Compaq)

machines to use AMBER6. I could get good speed, the best I could get here

for amber6 benchmark, one these. The problem was the deviation of the

numbers from the expected values.

For example, with one processor on ES40s/16 processors on IBM SP3/24

processors on Linux clusters/Sun E-420R with 4 processors, after the

benchmark (100 steps), I got identical mdinfo files. With more than one

processor on ES40, the numbers are wrong!

To show the seriousness of the problem, the following are the energies for

a correct mdout followed by that from ES40s with 4 processors:

==========================================================================

+++ Correct +++

NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00

Etot = -57716.6183 EKtot = 14145.7439 EPtot = -71862.3622

BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542

1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892

EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000

Ewald error estimate: 0.3783E-04

+++ ES40 with 4 processors

NSTEP = 1 TIME(PS) = 510.051 TEMP(K) = 302.02 PRESS = 0.00

Etot = -57716.6161 EKtot = 14145.7461 EPtot = -71862.3622

BOND = 452.1690 ANGLE = 1277.0334 DIHED = 968.3542

1-4 NB = 545.9440 1-4 EEL = 6666.3920 VDWAALS = 8109.3892

EELEC = -89881.6441 EHBOND = 0.0000 CONSTRAINT = 0.0000

Ewald error estimate: 0.3783E-04

==========================================================================

Etot differs at second decimal by 1st step!

For a long time I suspected something wrong with the compaq

dxml/fortran/mpi libraries.

Finally, I decided to pinpoint the problem, and tried MPICH instead of

compaq MPI. MPICH passed all the tests showing things were fine, but still

gave the same numbers. So COMPAQ MPI (probably) is not a problem.

Then I excluded COMPAQ mathematical libraries and dxml and used the ones

that come with amber. That also did not solve the problem. So COMPAQ DXML

is also not the culprit.

I played around with as many fortrans and their options to make the thing

work, but it made no difference.

Putting all these in perspective, I reluctantly concluded that the problem

is equally likely with amber6, not just compaq.

Can anyone help me sorting with the problem. I tried looking into code, I

couldn't follow much as I am not acquainted with MPI programming. The

following are some more clues and notes:

NOTE:

1. The problem is CONFINED (dEKtot=0.0022) to the calculation of EKtot

alone, as it looks like, with which I am surprised. (Others to follow

suite are temperatures of course. This was verified even from mden files

at the end of 1st step).

2. System tested was DHFR (benchmark), a part of amber test suite.

3. Numbers change with the number of processors. Even if I spawn more

threads onto single processor, I have the same problem. So, this is

indicative of MPI problems with COMPAQ more than of amber.

4. Since the same numbers are reproduced on ES40s elsewhere as well, it is

not a problem with installation.

5. Since DS20s does not give the problem mentioned above, it means,

probably, COMPAQ MPIs are still suspectible.

6. Discripancies of the order above at each step would mean very serious

problems. If overall parameters are statistically similar, it might as

well mean due to the energy landscape robustness and luck and nothing

more.

Thanks for any help,

Sincerely,

-Sanjeev

Received on Tue Mar 19 2002 - 23:01:17 PST

Custom Search