- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Robert Duke <rduke.email.unc.edu>

Date: Mon, 4 Jul 2005 16:30:54 -0400

Petr -

Thanks for the problem report. I have not yet thoroughly looked at

everything yet, but it looks to me like the whole problem hinges on whether

a fractional coordinate calculation can produce a result outside the range

of 0.0 to 0.999... (ie., if this assertion is correct, bad things will

follow elsewhere; no need to even check). Now, the basic fractional

coordinate calculation is of the form:

fract = f1 - dnint(f1) + 0.5d0

Have you actually trapped a value outside the range, or is this a postulate

to explain results? According to the definition of dnint() (f95 reference

standard) dnint(a) has the value of dint(a + 0.5) (actually the ref just

defines anint/aint, but dnint/dint are just the specific double precision

intrinsics), so for instance dnint(1.5) ALWAYS should round up 2.0; given

that 0.5 is precisely representable internally in binary, there should be no

rounding errors that can produce a 1.0 result from this calculation (for

negative values it is the same story, but dnint(a) has the value dint(a -

0.5). I have not thought about this in a while, but remember looking it

over fairly closely when the code was first done; the basic calculation and

assumptions also exist in sander; I think this stuff all derives from Tom

Darden's original pbc code, but maybe I am wrong about that. I'll do some

checks with ifort and your samples in the near future, but unless a

fractional is producing 1.0, this is not the problem. If dnint() does

produce such a result, then it is broken; there have been broken intel

library functions before, but one would hope that the math intrinsics are

right.

Regards - Bob Duke

----- Original Message -----

From: "Petr Kulhanek" <kulhanek.chemi.muni.cz>

To: <amber.scripps.edu>

Sent: Monday, July 04, 2005 12:33 PM

Subject: AMBER: pmemd - bug report

*> Dear AMBER developers,
*

*>
*

*> I would like to report a bug in PMEMD code (AMBER 8.0).
*

*>
*

*> The bug is responsible that some atoms are incorrectly addressed to
*

*> respective subcells. The bug occures only if some atoms are very close to
*

*> box walls (confirmed) or if they are on subcell interfaces (just
*

*> hypothesis). As the result, program can crash or the non-bonded energy is
*

*> not calculated correctly. The bug probably depends on architecture and/or
*

*> used compiler.
*

*>
*

*> Test cases (for IA32, ifort 8.1.024, and 1CPU) can be downloaded from:
*

*>
*

*> segmentation fault
*

*> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst1.tar.gz
*

*>
*

*> infinity Eel
*

*> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst2.tar.gz
*

*>
*

*> incorrect Eel
*

*> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst3.tar.gz
*

*>
*

*> In each archive, six results are present:
*

*>
*

*> mdout_bug_reference - calculated by pmemd on coordinates that leads to
*

*> incorrect results
*

*> mdout_bug_imaged_reference - calculated by pmemd on coordinates, which
*

*> were processed by ptraj (solut is centered and system is imaged)
*

*> mdout_fix_reference - calculated by modified pmemd (see below) on
*

*> coordinates that leads to incorrect results with non-modified pmemd
*

*> mdout_fix_imaged_reference - calculated by modified pmemd (see below) on
*

*> coordinates, which were processed by ptraj (solut is centered and system
*

*> is imaged)
*

*> mdout_sander_reference - calculated by sander on coordinates that leads to
*

*> incorrect results with non-modified pmemd
*

*> mdout_sander_imaged_reference - calculated by sander on coordinates, which
*

*> were processed by ptraj (solut is centered and system is imaged)
*

*>
*

*>
*

*> The bug is located in following files:
*

*>
*

*> ew_direct_cit.f90 -> subroutine get_fract_crds (confirmed)
*

*> fraction could be equal to 1.0 due to round errors, however the code
*

*> suppose that it has to be less than 1.0
*

*>
*

*> ew_direct_cit.f90 -> subroutine setup_crd_idx_lst_tbl (confirmed)
*

*> indexes to crd_idx_lst_tbl array are not checked if they are in legal
*

*> ranges
*

*> if fractions calculated in get_fract_crds are equal to 1.0 then indexes
*

*> overflow crd_idx_lst_tbl boundaries
*

*>
*

*> ew_pairlist.f90 -> subroutine get_nb_list (confirmed) and probably
*

*> get_nb_list_nonorthog (not tested)
*

*> indexes to x_bkts, y_bkts, and especially z_bkts arrays are not checked if
*

*> they are in legal ranges
*

*> array overflows could occur even if fractions are in legal ranges because
*

*> the indexes are calculated directly from imaged coordinates (different
*

*> approach -> different round errors)
*

*>
*

*> ew_recip_cit.f90 -> subroutine claim_recip_imgs (hypothesis) and probably
*

*> claim_recip_imgs_nonorthog (not tested)
*

*> z_bkt_lo and z_bkt_hi, the same as above
*

*>
*

*>
*

*> I prepared two patches for testing purposes. (reference source is amber8
*

*> distro + bugfixes 1-44)
*

*> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/check.patch
*

*> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/fix.patch
*

*>
*

*> The first one extends code with boundary check in above mentioned
*

*> subroutines. If some overflow occures then the program is immediatelly
*

*> stop with error message. The second one fixes the problem (boundary check
*

*> is still present).
*

*> fix.patch is only intended for testing purposes because it does not solve
*

*> possible problems in simulations using truncated octahedral box and it was
*

*> not tested under parallel execution.
*

*>
*

*> Some comments at the end:
*

*>
*

*> I found this bug when I did postprocessing analysis of trajectory
*

*> calculated by pmemd (MD was run without any problems as I remember). In
*

*> the analysis, the electrostatic energy is calculated for several system
*

*> subparts still under PBC. In the reported cases, only solut and solvent
*

*> are present (ions were stripped away), so the test cases represent rare
*

*> systems. However observed errors depend only on coordinates and I do not
*

*> see any reason why it could not occur during normal MD !
*

*> I think that it is neccessary to carefully check the pmemd code and remove
*

*> possible problems with reported round errors when subcell indexes are
*

*> calculated. Secondly, it is also important to keep the parts, in which
*

*> subcell indexes are calculated, as same as possible to avoid different
*

*> round errors.
*

*>
*

*> For SANDER developers:
*

*>
*

*> Please look to tst1.tar.gz archive. mdout_sander_reference and
*

*> mdout_sander_imaged_reference differ. The difference is about 3 kcal/mol
*

*> in Eel as well as in Evdw. I think that this could be a bug in sander.
*

*>
*

*> Best regards,
*

*> Petr
*

*>
*

*> --
*

*> ##################################################
*

*> Petr Kulhanek
*

*> ------------------------------------------------
*

*> kulhanek.chemi.muni.cz
*

*> ------------------------------------------------
*

*> National Centre for Biomolecular Research
*

*> Masaryk University Brno
*

*> Kotlarska 2, CZ-611 37 Brno
*

*> Czech Republic
*

*> ##################################################
*

*> -----------------------------------------------------------------------
*

*> The AMBER Mail Reflector
*

*> To post, send mail to amber.scripps.edu
*

*> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
*

*>
*

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Mon Jul 04 2005 - 21:53:00 PDT

Date: Mon, 4 Jul 2005 16:30:54 -0400

Petr -

Thanks for the problem report. I have not yet thoroughly looked at

everything yet, but it looks to me like the whole problem hinges on whether

a fractional coordinate calculation can produce a result outside the range

of 0.0 to 0.999... (ie., if this assertion is correct, bad things will

follow elsewhere; no need to even check). Now, the basic fractional

coordinate calculation is of the form:

fract = f1 - dnint(f1) + 0.5d0

Have you actually trapped a value outside the range, or is this a postulate

to explain results? According to the definition of dnint() (f95 reference

standard) dnint(a) has the value of dint(a + 0.5) (actually the ref just

defines anint/aint, but dnint/dint are just the specific double precision

intrinsics), so for instance dnint(1.5) ALWAYS should round up 2.0; given

that 0.5 is precisely representable internally in binary, there should be no

rounding errors that can produce a 1.0 result from this calculation (for

negative values it is the same story, but dnint(a) has the value dint(a -

0.5). I have not thought about this in a while, but remember looking it

over fairly closely when the code was first done; the basic calculation and

assumptions also exist in sander; I think this stuff all derives from Tom

Darden's original pbc code, but maybe I am wrong about that. I'll do some

checks with ifort and your samples in the near future, but unless a

fractional is producing 1.0, this is not the problem. If dnint() does

produce such a result, then it is broken; there have been broken intel

library functions before, but one would hope that the math intrinsics are

right.

Regards - Bob Duke

----- Original Message -----

From: "Petr Kulhanek" <kulhanek.chemi.muni.cz>

To: <amber.scripps.edu>

Sent: Monday, July 04, 2005 12:33 PM

Subject: AMBER: pmemd - bug report

-----------------------------------------------------------------------

The AMBER Mail Reflector

To post, send mail to amber.scripps.edu

To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Mon Jul 04 2005 - 21:53:00 PDT

Custom Search