AMBER: Re: pmemd problems from Robert Duke on 2005-02-17 (Amber Archive Feb 2005)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 17 Feb 2005 10:15:33 -0500

Folks - I am posting this to the list because I bothered to write up a long
response, and this is clearly a question that bothers folks. Those of us in
the business of producing numbers sometimes tend to take the number of
significant digits shown in a printout a little too seriously...

Fabien -
This is all perfectly normal behaviour, but I understand how it is that
folks get concerned. We would like to think that the same computation
produces the same value every time. The problem comes in with what one
means by the "same computation". Even in running on one processor and
making slight changes (ie., changes that only affect order of operations,
not what operations are performed) to the code, or doing different levels of
optimization or running on different types of cpu, you will see very
slightly different values creeping in at the low end of the double precision
values - rounding error. Typically, reported values will be consistent out
to around 300 steps, but past that point, the values in mdout will begin to
diverge. If you do this same test with sander on different types of
machines (say linux i386 and sgi) you will see the same affect. Also note
that you may see numbers with differences in the last digit before 300
steps, though the differences will often go away, and you will only start
diverging later. This is a rounding effect; say internally you had the
number 1234.5645499999999, it would be reported externally as 1234.5645 to
eight decimal digits, but if you change something like an energy summation
ever so slightly (imagine 100,000 atoms * 150 interactions with division by
the sqrt of the distance for each sum term, for instance), you might produce
1234.5645500000000, and a delta in the 17th decimal place gets reported.
The extent to which you see this basic problem varies depending on machine
architecture; intel chips and ibm power3/4 chips I know have higher internal
precision in the chips; the power3/4 chips have the ability to fuse floating
point operations and only lose precision in the stored result. Floating
point on computers is actually very nasty, if you like very precise results.
The values that are most quickly affected are the virials in constant
pressure calcs, simply due to the nature of the calculations done, and if
you look at pressures they are reported to low precision and vary. Kinetic
energy also suffers. You will only get the exact same results on one
processor if you use the same cpu type and the exact same executable
compiled with the exact same options running on the exact same operating
system compiled with the exact same compiler using the exact same compiler
and system libraries.

Okay, the above comments apply to sander and pmemd. With both of these
another level of indeterminacy occurs when you do parallel runs, because
here different processors are doing different portions of the various sums,
and then these summed values get reported (basically, the order of doing the
force and energy summations varies, with different resulting rounding
errors). If you do long runs with either sander or pmemd under mpi, you
will notice this indeterminacy, in the sense that runs with different
numbers of processors will produce different results. It has to be
remembered that the result differences are INSIGNIFICANT. More on that
later.

Now, does pmemd show a bit more indeterminacy than sander? For parallel
runs, the answer is almost certainly yes, and it is NOT a cause for alarm..
The reason is as follows. In order to obtain the highest possible
performance, pmemd does a couple of more sophisticated things: 1) it
completely reorders the arrangement of atoms used in nonbonded calcs,
producing better memory locality in the calculations (more of the
calculations occur in the fast cpu cache), and allowing for a more effective
parallel processing division of the nonbonded atoms. The reordering is
dependent on where each atom is in space at each list build cycle. This
means that the order in which summations are done changes, though the
summations done do not change, and 2) the pmemd code agressively load
balances the workload, which means that the numbers of nonbonded atoms
handled by each processor is changed to insure that each processor is
getting finished with each step at the same time. These factors introduce
all kinds of uncertainty into the order of summations, caused by different
performance of both each cpu and the network components in time (the cpu's
are subject to indeterminant workloads by at least the operating system
under the best of conditions, and other applications under the worst; a
bigger factor is impacts on net interconnect performance that vary from node
to node, impacts from shared memory cache components, etc.).

There is another place where pmemd may produce slightly different results in
parallel runs, and this has to do with axis orientation optimization in
pmemd. PMEMD will look at box geometry, and if an orthogonal box is
significantly longer on one axis, for internal calculations the coordinates
will be flipped around to make the longer axis the z axis. This is done
because one gets better spatial locality due to the orientation of fft slabs
in the reciprocal sum if one does this (ie., in parallel runs, less data
sharing is necessary between processors). Now, if you are using shake, you
go through an iterative procedure of adjusting coordinates, and I have not
bothered to change the order of these adjustments to keep it consistent with
axis reorientation. So shake produces slightly different results if the
axes are reoriented. Which answer is correct? Both and neither. Shake is
an approximation, and neither result is more correct. You can get better
results by using a smaller number for the shake tolerance, tol. This effect
can be particularly pronounced if you have bad hotspots in your system,
because shake will produce significantly different results in those
circumstances (bad hotspots, or incorrect ewald parameters will be evident
by ewald error estimate values with exponents of e-02 or larger; typically
you want ewald error estimate values down around e-04. For this reason
pmemd turns off axis optimizations for minimizations, and it is only turned
on in parallel runs with orthogonal unit cells with some aspect ratio >=
1.5, I believe. You CAN force it off by using the use_axis_opt = 0 option
in &ewald if you have an MD run with hotspots, but it is probably best to
get the system well minimized and gradually heated up. If axis optimization
is being used, there is a line in mdout that reports that it is turned on.

So I hope that somewhat allays your fears. As some folks say, due to
rounding errors we are just examining different phase space with different
runs. What you are observing is differences out around the 10th decimal
place after literally billions of floating point operations involving
+,-,*,/,sqrt,exp,sin,cos,erf, etc., with some approximation algorithms held
to certain tolerances thrown in for good measure. The original rounding
errors occur due to simple summations and differences, but are quickly
magnified in multiplication, division, sqrt, etc. (remember, at each step
forces are used to determine the next set of coordinates). If you really
want to worry about something, think about the fact that all this
calculation is based on a hugh pile of empirical parameters that typically
are know to somewhere between 3 and 5 significant digits.

Regards - Bob Duke

----- Original Message -----
From: "Fabien Cailliez" <Fabien.Cailliez.ibpc.fr>
To: "Robert Duke" <rduke.email.unc.edu>
Sent: Thursday, February 17, 2005 8:53 AM
Subject: pmemd problems

> Dear Dr Duke,
>
> I have remarked something strange in pmemd behavior.
> I am working on 16 processors on SGI machines with mpi.
> I remarked that my runs during my MD were not reproducible.
> From the same starting point (with velocities read into the rst
> file), I can not reproduce the same results.
> Here are the results I obtained for 3 runs from the same starting point:
> 1)
> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
> 13.1
> Etot = -213517.0647 EKtot = 51393.6405 EPtot
> = -264910.7052
> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
> 4131.0446
> 1-4 NB = 1539.9896 1-4 EEL = 19701.1888 VDWAALS =
> 33279.2817
> EELEC = -328147.4351 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 23535.4332 VIRIAL = 23296.0162 VOLUME =
> 848190.7099
> Density =
> 1.0129
> Ewald error estimate: 0.5374E-04
> ------------------------------------------------------------------------------
> 2)
> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
> 13.1
> Etot = -213517.0654 EKtot = 51393.6395 EPtot
> = -264910.7049
> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
> 4131.0446
> 1-4 NB = 1539.9896 1-4 EEL = 19701.1889 VDWAALS =
> 33279.2825
> EELEC = -328147.4356 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 23535.4316 VIRIAL = 23296.0149 VOLUME =
> 848190.7098
> Density =
> 1.0129
> Ewald error estimate: 0.5377E-04
> ------------------------------------------------------------------------------
> 3)
> NSTEP = 500 TIME(PS) = 2401.000 TEMP(K) = 299.36 PRESS =
> 13.1
> Etot = -213517.0647 EKtot = 51393.6374 EPtot
> = -264910.7020
> BOND = 1270.7129 ANGLE = 3314.5123 DIHED =
> 4131.0446
> 1-4 NB = 1539.9896 1-4 EEL = 19701.1889 VDWAALS =
> 33279.2823
> EELEC = -328147.4325 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 23535.4327 VIRIAL = 23295.9982 VOLUME =
> 848190.7099
> Density =
> 1.0129
> Ewald error estimate: 0.5371E-04
>
> I am really confused with that since I never had this problem with sander
> before....
> Someone in my lab observed the same things (on the same cluster). She just
> shows me
> one mail that you posted
> (http://amber.scripps.edu/Questions/mail/308.html) that may
> explain this thing but to be honnest, I am not sure to understand
> correctly.
> That is why I write directly to you to have further information.
> Is this behavior "normal" ? Or is this problem due to SGI use ?
>
> Thanks in advance for your help,
> Fabien
>
> --
> __________________________________________________________________
> Fabien Cailliez Tel : 01 58 41 51 63 Laboratoire de Biochimie Théorique
> e-mail : cailliez.ibpc.fr
> IBPC 13, rue Pierre et Marie Curie 75005 Paris
> __________________________________________________________________
>
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Thu Feb 17 2005 - 15:53:01 PST