- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Robert Duke <rduke.email.unc.edu>

Date: Thu, 17 Feb 2005 10:15:33 -0500

Folks - I am posting this to the list because I bothered to write up a long

response, and this is clearly a question that bothers folks. Those of us in

the business of producing numbers sometimes tend to take the number of

significant digits shown in a printout a little too seriously...

Fabien -

This is all perfectly normal behaviour, but I understand how it is that

folks get concerned. We would like to think that the same computation

produces the same value every time. The problem comes in with what one

means by the "same computation". Even in running on one processor and

making slight changes (ie., changes that only affect order of operations,

not what operations are performed) to the code, or doing different levels of

optimization or running on different types of cpu, you will see very

slightly different values creeping in at the low end of the double precision

values - rounding error. Typically, reported values will be consistent out

to around 300 steps, but past that point, the values in mdout will begin to

diverge. If you do this same test with sander on different types of

machines (say linux i386 and sgi) you will see the same affect. Also note

that you may see numbers with differences in the last digit before 300

steps, though the differences will often go away, and you will only start

diverging later. This is a rounding effect; say internally you had the

number 1234.5645499999999, it would be reported externally as 1234.5645 to

eight decimal digits, but if you change something like an energy summation

ever so slightly (imagine 100,000 atoms * 150 interactions with division by

the sqrt of the distance for each sum term, for instance), you might produce

1234.5645500000000, and a delta in the 17th decimal place gets reported.

The extent to which you see this basic problem varies depending on machine

architecture; intel chips and ibm power3/4 chips I know have higher internal

precision in the chips; the power3/4 chips have the ability to fuse floating

point operations and only lose precision in the stored result. Floating

point on computers is actually very nasty, if you like very precise results.

The values that are most quickly affected are the virials in constant

pressure calcs, simply due to the nature of the calculations done, and if

you look at pressures they are reported to low precision and vary. Kinetic

energy also suffers. You will only get the exact same results on one

processor if you use the same cpu type and the exact same executable

compiled with the exact same options running on the exact same operating

system compiled with the exact same compiler using the exact same compiler

and system libraries.

Okay, the above comments apply to sander and pmemd. With both of these

another level of indeterminacy occurs when you do parallel runs, because

here different processors are doing different portions of the various sums,

and then these summed values get reported (basically, the order of doing the

force and energy summations varies, with different resulting rounding

errors). If you do long runs with either sander or pmemd under mpi, you

will notice this indeterminacy, in the sense that runs with different

numbers of processors will produce different results. It has to be

remembered that the result differences are INSIGNIFICANT. More on that

later.

Now, does pmemd show a bit more indeterminacy than sander? For parallel

runs, the answer is almost certainly yes, and it is NOT a cause for alarm..

The reason is as follows. In order to obtain the highest possible

performance, pmemd does a couple of more sophisticated things: 1) it

completely reorders the arrangement of atoms used in nonbonded calcs,

producing better memory locality in the calculations (more of the

calculations occur in the fast cpu cache), and allowing for a more effective

parallel processing division of the nonbonded atoms. The reordering is

dependent on where each atom is in space at each list build cycle. This

means that the order in which summations are done changes, though the

summations done do not change, and 2) the pmemd code agressively load

balances the workload, which means that the numbers of nonbonded atoms

handled by each processor is changed to insure that each processor is

getting finished with each step at the same time. These factors introduce

all kinds of uncertainty into the order of summations, caused by different

performance of both each cpu and the network components in time (the cpu's

are subject to indeterminant workloads by at least the operating system

under the best of conditions, and other applications under the worst; a

bigger factor is impacts on net interconnect performance that vary from node

to node, impacts from shared memory cache components, etc.).

There is another place where pmemd may produce slightly different results in

parallel runs, and this has to do with axis orientation optimization in

pmemd. PMEMD will look at box geometry, and if an orthogonal box is

significantly longer on one axis, for internal calculations the coordinates

will be flipped around to make the longer axis the z axis. This is done

because one gets better spatial locality due to the orientation of fft slabs

in the reciprocal sum if one does this (ie., in parallel runs, less data

sharing is necessary between processors). Now, if you are using shake, you

go through an iterative procedure of adjusting coordinates, and I have not

bothered to change the order of these adjustments to keep it consistent with

axis reorientation. So shake produces slightly different results if the

axes are reoriented. Which answer is correct? Both and neither. Shake is

an approximation, and neither result is more correct. You can get better

results by using a smaller number for the shake tolerance, tol. This effect

can be particularly pronounced if you have bad hotspots in your system,

because shake will produce significantly different results in those

circumstances (bad hotspots, or incorrect ewald parameters will be evident

by ewald error estimate values with exponents of e-02 or larger; typically

you want ewald error estimate values down around e-04. For this reason

pmemd turns off axis optimizations for minimizations, and it is only turned

on in parallel runs with orthogonal unit cells with some aspect ratio >=

1.5, I believe. You CAN force it off by using the use_axis_opt = 0 option

in &ewald if you have an MD run with hotspots, but it is probably best to

get the system well minimized and gradually heated up. If axis optimization

is being used, there is a line in mdout that reports that it is turned on.

So I hope that somewhat allays your fears. As some folks say, due to

rounding errors we are just examining different phase space with different

runs. What you are observing is differences out around the 10th decimal

place after literally billions of floating point operations involving

+,-,*,/,sqrt,exp,sin,cos,erf, etc., with some approximation algorithms held

to certain tolerances thrown in for good measure. The original rounding

errors occur due to simple summations and differences, but are quickly

magnified in multiplication, division, sqrt, etc. (remember, at each step

forces are used to determine the next set of coordinates). If you really

want to worry about something, think about the fact that all this

calculation is based on a hugh pile of empirical parameters that typically

are know to somewhere between 3 and 5 significant digits.

Regards - Bob Duke

----- Original Message -----

From: "Fabien Cailliez" <Fabien.Cailliez.ibpc.fr>

To: "Robert Duke" <rduke.email.unc.edu>

Sent: Thursday, February 17, 2005 8:53 AM

Subject: pmemd problems

*> Dear Dr Duke,
*

*>
*

*> I have remarked something strange in pmemd behavior.
*

*> I am working on 16 processors on SGI machines with mpi.
*

*> I remarked that my runs during my MD were not reproducible.
*

*> From the same starting point (with velocities read into the rst
*

*> file), I can not reproduce the same results.
*

*> Here are the results I obtained for 3 runs from the same starting point:
*

*> 1)
*

*> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
*

*> 13.1
*

*> Etot = -213517.0647 EKtot = 51393.6405 EPtot
*

*> = -264910.7052
*

*> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
*

*> 4131.0446
*

*> 1-4 NB = 1539.9896 1-4 EEL = 19701.1888 VDWAALS =
*

*> 33279.2817
*

*> EELEC = -328147.4351 EHBOND = 0.0000 RESTRAINT =
*

*> 0.0000
*

*> EKCMT = 23535.4332 VIRIAL = 23296.0162 VOLUME =
*

*> 848190.7099
*

*> Density =
*

*> 1.0129
*

*> Ewald error estimate: 0.5374E-04
*

*> ------------------------------------------------------------------------------
*

*> 2)
*

*> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
*

*> 13.1
*

*> Etot = -213517.0654 EKtot = 51393.6395 EPtot
*

*> = -264910.7049
*

*> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
*

*> 4131.0446
*

*> 1-4 NB = 1539.9896 1-4 EEL = 19701.1889 VDWAALS =
*

*> 33279.2825
*

*> EELEC = -328147.4356 EHBOND = 0.0000 RESTRAINT =
*

*> 0.0000
*

*> EKCMT = 23535.4316 VIRIAL = 23296.0149 VOLUME =
*

*> 848190.7098
*

*> Density =
*

*> 1.0129
*

*> Ewald error estimate: 0.5377E-04
*

*> ------------------------------------------------------------------------------
*

*> 3)
*

*> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
*

*> 13.1
*

*> Etot = -213517.0647 EKtot = 51393.6374 EPtot
*

*> = -264910.7020
*

*> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
*

*> 4131.0446
*

*> 1-4 NB = 1539.9896 1-4 EEL = 19701.1889 VDWAALS =
*

*> 33279.2823
*

*> EELEC = -328147.4325 EHBOND = 0.0000 RESTRAINT =
*

*> 0.0000
*

*> EKCMT = 23535.4327 VIRIAL = 23295.9982 VOLUME =
*

*> 848190.7099
*

*> Density =
*

*> 1.0129
*

*> Ewald error estimate: 0.5371E-04
*

*>
*

*> I am really confused with that since I never had this problem with sander
*

*> before....
*

*> Someone in my lab observed the same things (on the same cluster). She just
*

*> shows me
*

*> one mail that you posted
*

*> (http://amber.scripps.edu/Questions/mail/308.html) that may
*

*> explain this thing but to be honnest, I am not sure to understand
*

*> correctly.
*

*> That is why I write directly to you to have further information.
*

*> Is this behavior "normal" ? Or is this problem due to SGI use ?
*

*>
*

*> Thanks in advance for your help,
*

*> Fabien
*

*>
*

*> --
*

*> __________________________________________________________________
*

*> Fabien Cailliez Tel : 01 58 41 51 63 Laboratoire de Biochimie Théorique
*

*> e-mail : cailliez.ibpc.fr
*

*> IBPC 13, rue Pierre et Marie Curie 75005 Paris
*

*> __________________________________________________________________
*

*>
*

*>
*

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Thu Feb 17 2005 - 15:53:01 PST

Date: Thu, 17 Feb 2005 10:15:33 -0500

Folks - I am posting this to the list because I bothered to write up a long

response, and this is clearly a question that bothers folks. Those of us in

the business of producing numbers sometimes tend to take the number of

significant digits shown in a printout a little too seriously...

Fabien -

This is all perfectly normal behaviour, but I understand how it is that

folks get concerned. We would like to think that the same computation

produces the same value every time. The problem comes in with what one

means by the "same computation". Even in running on one processor and

making slight changes (ie., changes that only affect order of operations,

not what operations are performed) to the code, or doing different levels of

optimization or running on different types of cpu, you will see very

slightly different values creeping in at the low end of the double precision

values - rounding error. Typically, reported values will be consistent out

to around 300 steps, but past that point, the values in mdout will begin to

diverge. If you do this same test with sander on different types of

machines (say linux i386 and sgi) you will see the same affect. Also note

that you may see numbers with differences in the last digit before 300

steps, though the differences will often go away, and you will only start

diverging later. This is a rounding effect; say internally you had the

number 1234.5645499999999, it would be reported externally as 1234.5645 to

eight decimal digits, but if you change something like an energy summation

ever so slightly (imagine 100,000 atoms * 150 interactions with division by

the sqrt of the distance for each sum term, for instance), you might produce

1234.5645500000000, and a delta in the 17th decimal place gets reported.

The extent to which you see this basic problem varies depending on machine

architecture; intel chips and ibm power3/4 chips I know have higher internal

precision in the chips; the power3/4 chips have the ability to fuse floating

point operations and only lose precision in the stored result. Floating

point on computers is actually very nasty, if you like very precise results.

The values that are most quickly affected are the virials in constant

pressure calcs, simply due to the nature of the calculations done, and if

you look at pressures they are reported to low precision and vary. Kinetic

energy also suffers. You will only get the exact same results on one

processor if you use the same cpu type and the exact same executable

compiled with the exact same options running on the exact same operating

system compiled with the exact same compiler using the exact same compiler

and system libraries.

Okay, the above comments apply to sander and pmemd. With both of these

another level of indeterminacy occurs when you do parallel runs, because

here different processors are doing different portions of the various sums,

and then these summed values get reported (basically, the order of doing the

force and energy summations varies, with different resulting rounding

errors). If you do long runs with either sander or pmemd under mpi, you

will notice this indeterminacy, in the sense that runs with different

numbers of processors will produce different results. It has to be

remembered that the result differences are INSIGNIFICANT. More on that

later.

Now, does pmemd show a bit more indeterminacy than sander? For parallel

runs, the answer is almost certainly yes, and it is NOT a cause for alarm..

The reason is as follows. In order to obtain the highest possible

performance, pmemd does a couple of more sophisticated things: 1) it

completely reorders the arrangement of atoms used in nonbonded calcs,

producing better memory locality in the calculations (more of the

calculations occur in the fast cpu cache), and allowing for a more effective

parallel processing division of the nonbonded atoms. The reordering is

dependent on where each atom is in space at each list build cycle. This

means that the order in which summations are done changes, though the

summations done do not change, and 2) the pmemd code agressively load

balances the workload, which means that the numbers of nonbonded atoms

handled by each processor is changed to insure that each processor is

getting finished with each step at the same time. These factors introduce

all kinds of uncertainty into the order of summations, caused by different

performance of both each cpu and the network components in time (the cpu's

are subject to indeterminant workloads by at least the operating system

under the best of conditions, and other applications under the worst; a

bigger factor is impacts on net interconnect performance that vary from node

to node, impacts from shared memory cache components, etc.).

There is another place where pmemd may produce slightly different results in

parallel runs, and this has to do with axis orientation optimization in

pmemd. PMEMD will look at box geometry, and if an orthogonal box is

significantly longer on one axis, for internal calculations the coordinates

will be flipped around to make the longer axis the z axis. This is done

because one gets better spatial locality due to the orientation of fft slabs

in the reciprocal sum if one does this (ie., in parallel runs, less data

sharing is necessary between processors). Now, if you are using shake, you

go through an iterative procedure of adjusting coordinates, and I have not

bothered to change the order of these adjustments to keep it consistent with

axis reorientation. So shake produces slightly different results if the

axes are reoriented. Which answer is correct? Both and neither. Shake is

an approximation, and neither result is more correct. You can get better

results by using a smaller number for the shake tolerance, tol. This effect

can be particularly pronounced if you have bad hotspots in your system,

because shake will produce significantly different results in those

circumstances (bad hotspots, or incorrect ewald parameters will be evident

by ewald error estimate values with exponents of e-02 or larger; typically

you want ewald error estimate values down around e-04. For this reason

pmemd turns off axis optimizations for minimizations, and it is only turned

on in parallel runs with orthogonal unit cells with some aspect ratio >=

1.5, I believe. You CAN force it off by using the use_axis_opt = 0 option

in &ewald if you have an MD run with hotspots, but it is probably best to

get the system well minimized and gradually heated up. If axis optimization

is being used, there is a line in mdout that reports that it is turned on.

So I hope that somewhat allays your fears. As some folks say, due to

rounding errors we are just examining different phase space with different

runs. What you are observing is differences out around the 10th decimal

place after literally billions of floating point operations involving

+,-,*,/,sqrt,exp,sin,cos,erf, etc., with some approximation algorithms held

to certain tolerances thrown in for good measure. The original rounding

errors occur due to simple summations and differences, but are quickly

magnified in multiplication, division, sqrt, etc. (remember, at each step

forces are used to determine the next set of coordinates). If you really

want to worry about something, think about the fact that all this

calculation is based on a hugh pile of empirical parameters that typically

are know to somewhere between 3 and 5 significant digits.

Regards - Bob Duke

----- Original Message -----

From: "Fabien Cailliez" <Fabien.Cailliez.ibpc.fr>

To: "Robert Duke" <rduke.email.unc.edu>

Sent: Thursday, February 17, 2005 8:53 AM

Subject: pmemd problems

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Thu Feb 17 2005 - 15:53:01 PST

Custom Search