Re: AMBER: Implicit precision in sander vs architecture from Robert Duke on 2003-10-20 (Amber Archive Oct 2003)

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 20 Oct 2003 21:06:22 -0400

Folks -
I would not want to be an alarmist, but I think Surjit has a point. About
two machines back, when machines were going over the 128MB line somewhere in
the mid 90's, I bought a nice Dell with ECC memory because there had been
some concern about such issues when I worked at Microsoft on NT (yes, I did
NT dev work - networking and file systems - all the unix maniacs please put
the rotten eggs down). After that machine, I would have kept buying ECC
memory, but various other types of RAM became more popular, and I did not
see much discussion about things like ECC. I rationalized that any
cosmic-ray flipped bits would probably bring down the OS in fairly short
order anyway, making the problem apparent (I was writing code, not cranking
out data), and forgot about it. Trouble is, when we do MD calcs, we set up
really large targets for bit flipping in the form of all the really big data
arrays. If the chips don't detect or correct errors, bad things could
happen without bringing down the OS. I became more of a believer in this
when I put some more memory in one of my cheap linux office machines, and
suddenly had spurious weird data problems, but no problem with the OS. I at
first thought it was heating (a failing north bridge fan), but eventually
localized the problem when I realized it was only happening in the higher
reaches of memory and I did some chip swapping. Point is, cheap machines
could produce headaches, and things like non-error correcting memory could
produce real low frequency events that ever so slightly muck up the
dynamics. One of the really nasty aspects of MD is that there really is no
way to be certain of your results - you can't run regressions and no with
absolute certainty that everything is okay. Boy, do I still like integers
(no roundoff error). Currently, blades are a little suspect due to
overheating problems (sorry, didn't bookmark the reference). I have seen
weird network events on linux interconnects produce wrong data (run dies 50
steps in; identical run restarted goes for hours - we have been having
interconnect problems lately so this one is on my mind). My feeling lately
on all this has been that you get what you pay for.
Regards - Bob

----- Original Message -----
From: "Amber Administration" <amber-admin.scripps.edu>
To: <amber.scripps.edu>
Sent: Monday, October 20, 2003 7:18 PM
Subject: Re: AMBER: Implicit precision in sander vs architecture

> Hello,
>
> This message triggered our majordomo filtering.
>
> ---------- Forwarded message ----------
> Date: Fri, 17 Oct 2003 09:17:08 -0700 (PDT)
> From: Surjit Dixit <sdixit.wesleyan.edu>
> Reply-To: sdixit.wesleyan.edu
> Organization: Wesleyan University
> User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030312
> X-Accept-Language: en-us, en
> MIME-Version: 1.0
> To: amber.scripps.edu
> Subject: Re: AMBER: Implicit precision in sander vs architecture
> References: <1066405361.3638.52.camel.pcumr70.biomedicale.univ-paris5.fr>
> In-Reply-To: <1066405361.3638.52.camel.pcumr70.biomedicale.univ-paris5.fr>
> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> Content-Transfer-Encoding: 8bit
> X-ECS-MailScanner: Found to be clean
>
> Have you looked into the type of RAM on these 32 bit machines (PC's)? I
> have noticed differences between simulations being run on PC's with ECC
> (Error Checking and Correcting) and non-ECC RAM.
> The trajectories run on machine with nonECC RAM tend to be unstable and
> prone to such problems.
> Surjit
>
>
> Teletchéa Stéphane wrote:
> > I have recently seen in my dynamics simulations 'vlimit exceeded' on my
> > dual xeon although when using the same restart the calculation runs fine
> > on a power4 IBM.
> >
> > I know that implicit double precision is used in fortran code, but i
> > have no idea of what it means when the calculations are made on 32-bits
> > systems (pentium/athlon) or on 64-bits systems (power4, mips, ...).
> >
> > I'm a little bit afraid since i have the impression that these errors
> > only appear on 32-bits machines. Are roundoff more important there ?
> >
> > Can someone detail the implecit precision each architecture involves for
> > floating point operations ?
> >
> > Are 32-bits machines 2 times less precise than 64-bits machines ?
> >
> > I know that after a certain point (as precised by Dr. R. Duke) dynamics
> > do diverge 'naturally' because of roundoff but does it happen more
> > rapidly on 32-bits systems as i'm suspecting it form a lower precision ?
> >
> > Any hint would be very helpfull.
> >
> > Stéphane Teletchéa
> >
> > Writing his PhD ...
> > --
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Tue Oct 21 2003 - 02:53:01 PDT