- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Robert Duke <rduke.email.unc.edu>

Date: Mon, 17 Dec 2007 10:29:15 -0500

I find it sort of interesting that this sort of question keeps coming up.

It worries me though that folks have not thought the points through on this,

and we have stated many times that there is an inherent indeterminacy in

longer (> 500 step) parallel md runs that while annoying, IS OF NO

SIGNIFICANCE WHATSOEVER. The archive article linked below does not get the

explanation completely correct, and leads off with a statement of amber md

code being single precision as an explanation - patently incorrect I assure

you. The code is ALL double precision - and double precision has a

precision of between 16 and 17 decimal digits (the intel and ibm chips

actually have an internal 80 bit representation that increases precision as

long as you stay in fp registers by something like a factor of a thousand

(off the top of my head)). So this means that every fp arithmetic operation

is off by roughly 1 part in 10**17 in amber. This is fairly high

precision... Now, we do take liberties in the direct force calcs in that we

use spline tables for calculation of direct forces and energies, and this

lowers the precision of those operations to as low as 11 decimal digits of

precision, but that's as bad as it gets, in terms of easily defined losses

of precision (there are lots of other things in the numerical methods that

will cause the number to be "not quite right"; heck, every transcendental

function you use has some of this). Okay, but back to the parallel problem.

A parallel system, correctly implemented, insures correct results in that

the correct operations will be done regardless of how the problem is

parallelized. However, a given ordering of equivalent operations is not

guaranteed; eg. a + b + c may be executed as (a + b) + c one time and as a +

(b + c) another time. Same result, right? NO, not in a floating point

number system, where at the limit of precision you may be off by 1 least

significant digit for equivalent operations done in different orders. So,

each step of dynamics in a 100K system takes about 1 second on an intel 3

GHz chip. That is 3 billion math operations roughly speaking (maybe really

half that, as there is a tremendous amount of integer bookkeeping and memory

moves, especially in parallel code). But anyway, there are roughly 3

billion operations, and for pmemd results are typically reproducible in

parallel runs for between 300 and 500 steps. So lets rough that out as

3*10**9 * 333 steps is approx 1*10**12 math operations before there is a

noticeable loss in precision in the 4th decimal place. So it all boils down

to the 1) the tremendous number of calculations done, 2) rounding error in

the last decimal place (16th-17th) that varies depending on the ordering of

equivalent operations, 3) network indeterminacy in order of addition of

forces, energies, virials (primarily). What do I mean by network

indeterminacy? There is no fixed order of i/o completion in a distributed

asynchronous operation - if one node stalls a little bit due to a

nondeterministic event (like some service timer pops and the OS spends

10,000 cycles checking on something, or the interconnect services a

communications request less rapidly due to loading by jobs on other nodes) -

then the order of equivalent operations may vary and the result of say an

addition of 3 numbers from 3 different nodes may differ by one least

significant digit in different runs. Where else does grief arise? Well

compile pmemd using fftw for your fft's, and run a little experiment with

the uniprocessor code. Do 10 runs at different times on your machine - all

about 300 steps on something like factor ix. Look at results; they MAY be

slightly different. Why? FFTW does short little timing tests to pick the

best of several different algorithms to use on a given run. The algorithm

it chooses may differ from run to run, because for some there is not much

difference, so depending on what the various services on you workstation are

doing, sometimes the really optimal algorithm loses and another is chosen.

Now, fast fourier transforms are numerical methods, and most numerical

methods are even more imprecise than the number system representation

(otherwise you are really wasting time...), so the different algorithms get

slightly different results. And I have not mentioned here two other

factors - 1) different machines, and 2) my use of at least a dozen different

codepaths (basically equivalent algorithms, but once again the order of

equivalent operations may vary) depending on what is optimal for a given

machine.

All these errors are considered insignificant because the real accuracy of

the results you are producing is way below the precision errors here.

Absolute reproducibility is essential for developing code reliably, and I

have to be sure that all significant codepaths get exercised within the

"window of reproducibility" I have to work with, but the significance of the

precision-related errors in terms of the results is nil. The general

handwaving done is to simply say we are "sampling different regions of phase

space", and we are. Now, if you really want to worry about things that

significantly affect whether your results are "good", then I would recommend

worrying about 1) the quality of parameterization of the forcefield - all

current forcefields are works in progress (and take a gander at the

precision of the empirical constants in the ff files), 2) step integration

issues - like timestep size, shake params, etc., and 3) sampling

statistics - did you really run long enough to know that the results you are

seeing are representative of the system? How do you know? (I leave it for

the folks that specialize in those areas to answer these sorts of questions

;-)).

Best Regards - Bob Duke

PS - the best article in my opinion, on floating point issues is:

David Goldberg,

"What Every Computer Scientist Should Know about Floating-Point Arithmetic",

ACM Computing Surveys 23 (1991), pp5-48

----- Original Message -----

From: "Benjamin Juhl" <itbbju.itb.uni-stuttgart.de>

To: <amber.scripps.edu>

Sent: Monday, December 17, 2007 5:25 AM

Subject: Re: AMBER: pmemd: same rst, different result?

*> Hi Anselm,
*

*>
*

*> we observed the same thing recently and found that the explanaitions
*

*> from the following mail in the amber mailing list archive covers about
*

*> all your questions:
*

*> http://amber.ch.ic.ac.uk/archive/200401/0103.html
*

*>
*

*> Benjamin
*

*>
*

*>
*

*> Anselm Horn schrieb:
*

*>> Dear all,
*

*>>
*

*>> we encountered a strange behaviour during a pmemd restart on our cluster
*

*>> (infiniband network, job ran on 8 nodes with 4 processors each):
*

*>>
*

*>> A restart using a certain rst file crashed because there were numbers
*

*>> too large for the output format and thus only asterics appeared (known
*

*>> problem). When we reran the MD calculation starting from the previous
*

*>> rst file that also served as input for the trajectory that produced the
*

*>> corrupted rst file, pmemd then produced a valid rst file at the end of
*

*>> this new trajectory. Thus, the same rst file on the same (homogeneous)
*

*>> cluster with the same executable yielded a different result. How can
*

*>> this be explained?
*

*>>
*

*>> Is it possible that the parallel implementation is the reason for this
*

*>> non-deterministic behaviour - or does this simply indicate a potential
*

*>> hardware problem on some of the machines?
*

*>>
*

*>> Any hints are welcome.
*

*>>
*

*>> Regards,
*

*>>
*

*>> Anselm Horn
*

*>> Bioinformatik
*

*>> Emil-Fischer-Zentrum
*

*>> Friedrich-Alexander-Universität Erlangen-Nürnberg
*

*>>
*

*>>
*

*>>
*

*>>
*

*>> -----------------------------------------------------------------------
*

*>> The AMBER Mail Reflector
*

*>> To post, send mail to amber.scripps.edu
*

*>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
*

*>>
*

*>>
*

*>
*

*> -----------------------------------------------------------------------
*

*> The AMBER Mail Reflector
*

*> To post, send mail to amber.scripps.edu
*

*> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
*

*>
*

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Wed Dec 19 2007 - 06:07:18 PST

Date: Mon, 17 Dec 2007 10:29:15 -0500

I find it sort of interesting that this sort of question keeps coming up.

It worries me though that folks have not thought the points through on this,

and we have stated many times that there is an inherent indeterminacy in

longer (> 500 step) parallel md runs that while annoying, IS OF NO

SIGNIFICANCE WHATSOEVER. The archive article linked below does not get the

explanation completely correct, and leads off with a statement of amber md

code being single precision as an explanation - patently incorrect I assure

you. The code is ALL double precision - and double precision has a

precision of between 16 and 17 decimal digits (the intel and ibm chips

actually have an internal 80 bit representation that increases precision as

long as you stay in fp registers by something like a factor of a thousand

(off the top of my head)). So this means that every fp arithmetic operation

is off by roughly 1 part in 10**17 in amber. This is fairly high

precision... Now, we do take liberties in the direct force calcs in that we

use spline tables for calculation of direct forces and energies, and this

lowers the precision of those operations to as low as 11 decimal digits of

precision, but that's as bad as it gets, in terms of easily defined losses

of precision (there are lots of other things in the numerical methods that

will cause the number to be "not quite right"; heck, every transcendental

function you use has some of this). Okay, but back to the parallel problem.

A parallel system, correctly implemented, insures correct results in that

the correct operations will be done regardless of how the problem is

parallelized. However, a given ordering of equivalent operations is not

guaranteed; eg. a + b + c may be executed as (a + b) + c one time and as a +

(b + c) another time. Same result, right? NO, not in a floating point

number system, where at the limit of precision you may be off by 1 least

significant digit for equivalent operations done in different orders. So,

each step of dynamics in a 100K system takes about 1 second on an intel 3

GHz chip. That is 3 billion math operations roughly speaking (maybe really

half that, as there is a tremendous amount of integer bookkeeping and memory

moves, especially in parallel code). But anyway, there are roughly 3

billion operations, and for pmemd results are typically reproducible in

parallel runs for between 300 and 500 steps. So lets rough that out as

3*10**9 * 333 steps is approx 1*10**12 math operations before there is a

noticeable loss in precision in the 4th decimal place. So it all boils down

to the 1) the tremendous number of calculations done, 2) rounding error in

the last decimal place (16th-17th) that varies depending on the ordering of

equivalent operations, 3) network indeterminacy in order of addition of

forces, energies, virials (primarily). What do I mean by network

indeterminacy? There is no fixed order of i/o completion in a distributed

asynchronous operation - if one node stalls a little bit due to a

nondeterministic event (like some service timer pops and the OS spends

10,000 cycles checking on something, or the interconnect services a

communications request less rapidly due to loading by jobs on other nodes) -

then the order of equivalent operations may vary and the result of say an

addition of 3 numbers from 3 different nodes may differ by one least

significant digit in different runs. Where else does grief arise? Well

compile pmemd using fftw for your fft's, and run a little experiment with

the uniprocessor code. Do 10 runs at different times on your machine - all

about 300 steps on something like factor ix. Look at results; they MAY be

slightly different. Why? FFTW does short little timing tests to pick the

best of several different algorithms to use on a given run. The algorithm

it chooses may differ from run to run, because for some there is not much

difference, so depending on what the various services on you workstation are

doing, sometimes the really optimal algorithm loses and another is chosen.

Now, fast fourier transforms are numerical methods, and most numerical

methods are even more imprecise than the number system representation

(otherwise you are really wasting time...), so the different algorithms get

slightly different results. And I have not mentioned here two other

factors - 1) different machines, and 2) my use of at least a dozen different

codepaths (basically equivalent algorithms, but once again the order of

equivalent operations may vary) depending on what is optimal for a given

machine.

All these errors are considered insignificant because the real accuracy of

the results you are producing is way below the precision errors here.

Absolute reproducibility is essential for developing code reliably, and I

have to be sure that all significant codepaths get exercised within the

"window of reproducibility" I have to work with, but the significance of the

precision-related errors in terms of the results is nil. The general

handwaving done is to simply say we are "sampling different regions of phase

space", and we are. Now, if you really want to worry about things that

significantly affect whether your results are "good", then I would recommend

worrying about 1) the quality of parameterization of the forcefield - all

current forcefields are works in progress (and take a gander at the

precision of the empirical constants in the ff files), 2) step integration

issues - like timestep size, shake params, etc., and 3) sampling

statistics - did you really run long enough to know that the results you are

seeing are representative of the system? How do you know? (I leave it for

the folks that specialize in those areas to answer these sorts of questions

;-)).

Best Regards - Bob Duke

PS - the best article in my opinion, on floating point issues is:

David Goldberg,

"What Every Computer Scientist Should Know about Floating-Point Arithmetic",

ACM Computing Surveys 23 (1991), pp5-48

----- Original Message -----

From: "Benjamin Juhl" <itbbju.itb.uni-stuttgart.de>

To: <amber.scripps.edu>

Sent: Monday, December 17, 2007 5:25 AM

Subject: Re: AMBER: pmemd: same rst, different result?

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Wed Dec 19 2007 - 06:07:18 PST

Custom Search