Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-06-01 (Amber Archive Jun 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Sat, 1 Jun 2013 12:20:49 -0700

PS 100K steps is an empirical guess based on previous hunts for defective
GPUs in the days of C2050 and GTX 480...

Seems to be doing the trick here too...

On Sat, Jun 1, 2013 at 12:16 PM, Scott Le Grand <varelse2005.gmail.com>wrote:

> All force accumulation is done with 64-bit fixed point integers so their
> summation is utterly order-independent - all roundoff happens in a
> deterministic manner at the point of type conversion from single-precision
> to 64-bit int. Therefore, each simulation with the same starting
> conditions on the same hardware will follow the exact same chaotic
> trajectory - like watching the same movie over and over again - no two
> movies are alike, but watching the same movie twice better be.
>
> If it's not, it's some sort of bug... I did this both because I'm of the
> belief that reproducibility of experimental results is really important and
> because it's handy for finding SW and HW bugs when the appearance of
> nondeterministic divergent trajectories is a 100% indicator that something
> went wrong
>
> The only caveat is that in old versions of the program, energy summation
> was done partially with doubles in an unpredictable order. This then
> causes transient differences in the last sig-fig but the trajectory was
> still identical. This should be fix in current code.
>
> The only thing that's funny about these tests is how little they diverge.
> So I am *hoping* this might be a bug in cuFFT rather than a GTX Titan HW
> issue. This is one explanation that would explain why GB simulations are
> deterministic and PME simulations aren't.
>
>
>
>
>
>
>
>
> On Sat, Jun 1, 2013 at 12:04 PM, Jan-Philip Gehrcke <
> jgehrcke.googlemail.com> wrote:
>
>> On 06/01/2013 08:48 PM, Scott Le Grand wrote:
>> > "Also am I right in thinking (from what Scott was saying) that all the
>> > benchmarks should be reproducible across 50k steps but begin to diverge
>> at
>> > around 100K steps? Is there any difference from in setting *ig *to an
>> > explicit number to removing it from the mdin file?"
>> >
>> > They should *never* diverge when running the same code on the same GPU
>> > configuration on the same machine unless they use a different random
>> seed...
>> >
>>
>> No divergence after N time steps even for large N? How should that be
>> possible for a chaotic system? Are round-off errors deterministic then?
>> And if so, from which experience does the crucial limit of 100k steps
>> come which you are mentioning throughout this mailing list thread?
>>
>> Thanks for clarifying,
>>
>> Jan-Philip
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jun 01 2013 - 12:30:03 PDT