Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Marek Maly on 2013-05-30 (Amber Archive May 2013)

From: Marek Maly <marek.maly.ujep.cz>
Date: Thu, 30 May 2013 16:49:27 +0200

OK,
thanks a lot for info !

Something like year ago I successfully RMAed 2 x GTX 580, but in that case
I could argued with records of many errors obtained using memtestG80.

So I simply described problem, attached memtestG80 outputs and that's
was enough for them. Here (in Titan case) the situation is not so
clear/transparent
but I believe that eventually also in this case I can eventually succeeded
with RMAing of my titans.
But I am not sure which is the probability that the new 2 GPUs will be at
least as
good as those which I RMAed . I don't have time just for solving
repeatedly some RMAs issues ...

Anyway my strange (non-black and white) results from Amber tests (as well
as negative memtestG80
tests) suggest that cards are maybe not damaged and simple downclock to
stock frequency might solve
the problem. Simply 928 MHz on Titan might be still OK for gaming but on
the other hand
this frequency is already "risky" for reliable scientific computation
especially those long term
ones (days/weeks) and especially in case of some softwares (here Amber).

The idea that downclocking might help here is eventually supported by my
hypothesis that for factory overclocking just the best/well tested chips
are selected, but maybe I am too idealistic here :))
It's just really pity that NVclock is "impotent" here and one has to edit
directly GPU BIOS to
downclock GPU under linux. But I would wait with this eventual step as the
very last eventual possibility. I am waiting with some hope for the patch
which Ross announced.

BTW can you please confirm that also your overclocked Titan is
automatically running under
boost frequency = 928 MHz ? You may find out this using deviceQuery
routine which is the part
of CUDA samples but you can also find the actual GPU frequency written in
Amber mdout files.

see here my example from one such file:

|------------------- GPU DEVICE INFO --------------------
|
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: GeForce GTX TITAN
| CUDA Device Global Mem Size: 6143 MB
| CUDA Device Num Multiprocessors: 14
| CUDA Device Core Freq: 0.93 GHz
|
|--------------------------------------------------------

Thanks in advance !

M.

Dne Thu, 30 May 2013 16:06:17 +0200 ET <sketchfoot.gmail.com> napsal/-a:

> Hi,
>
> I don't think it's particularly lucky. :) The evidence pointed clearly to
> the hardware being faulty IMO. I RMA'd aprox three weeks after putchase,
> so
> I was out of my 7 day period (UK) where I can return if I don't like the
> color. Where did you get your card from? Is it harder to get an RMA in
> the
> country that you are based? I have heard (don't know how true it is) that
> it harder to do this in the states?
>
> I don't imagine they did anything more than run Heaven and Valley
> benchmarks. If it was a manufacture supplied test, then the manufacturer
> would have caught it before it was sent out for sale, and I can't
> imagine a
> store developing their own test, though I may be wrong on that.
>
> I will post my benchmark results asap, though it may be tommorrow.
>
> I hope you get your card sorted out too! :)
>
> FYI: My RMA request as follows:
>
>
> #######################################
> My Setup is as follows: i7-930 intel Quad core CPU, 6GB RAM on a Gigabyte
> GA-C58-UD7 motherboard. I have two NVIDIA GPUs installed: 1x EVGA
> superclocked Geforce Titan and the other (one that I wish to return)
> which
> is a standard (not overclocked) EVGA Geforce Titan. I'm not running an
> SLI
> setup and use the GPUs for running Bio-physical simulations. The system
> runs headless without any GUI and thus no display. This makes it a pure
> compute card and thus any errors are related to this rather than display
> misconfigurations.
>
> I have had the superclocked geforce for a longer time and have been
> benchmarking it against a standard test simulation without any issues. On
> receiving the standard geforce, I realised that it was crashing
> catastrophically (after 10- 15mins) whilst running the same benchmark
> that
> the other card did not have a problem with.
>
> I verified that this card was faulty by swapping the cards around so they
> occupied their partners PCI-e slot (so still in a dual GPU
> configuration).
> The problem persisted. So I took the superclocked card out and tested the
> card on its own in first one Pci-e slot, the the other. As the problem
> has
> not gone away and the other card tested did not have a problem with the
> bechmark, my conclusiion is that the standard Geforce is faulty.
>
> I would like to return the card for a replacement. If it is at all
> possible, could I get another superclocked EVGA? I am happy to pay the
> price difference.
> ###########################################
>
> br,
> g
>
>
> On 30 May 2013 13:32, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Lucky guy ! :))
>>
>> I am just curious which was your original justification
>> for RMA of that Titan. How did you argued here ? Just
>> using Amber instability calc. arguments or you also found
>> some errors during another common tests like memtestG80,
>> cuda_memtest, gpu_burn and/or some common Win performance testers
>> (Heaven, 3DMark ...) ?
>>
>> Would be nice to know the name of the test which returns technicians
>> used
>> and which clearly and undoubtedly proved that the given GPU is
>> defective.
>>
>> How long after purchase you RMAed this card ?
>>
>> I am also curious on your reproducibility Amber benchmark tests. Now I
>> am
>> doing
>> 500k steps long ones with updated driver 319.23 and for the moment
>> it does not seem that driver update solved the problems :((
>>
>> Marek
>>
>>
>>
>>
>> Dne Thu, 30 May 2013 14:08:18 +0200 ET <sketchfoot.gmail.com> napsal/-a:
>>
>> > An update:
>> >
>> > Just got a mail from ebuyer who said:
>> >
>> > Following extensive tests by our returns technicians, this item was
>> found
>> > to be faulty. A replacement product will be dispatched as soon as the
>> RMA
>> > is closed.
>> >
>> > For more details check the My Orders section of www.ebuyer.com
>> >
>> > Kind regards,
>> >
>> > Ebuyer Customer Support
>> >
>> >
>> >
>> > On 30 May 2013 09:33, ET <sketchfoot.gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I believe this was the specific driver I used:
>> >>
>> >> http://www.nvidia.com/object/linux-display-amd64-313.30-driver.html
>> >>
>> >>
>> >> I'm running the benchmark now on the super-duper-clocked geforce
>> that I
>> >> believe is "working". I can't do it on the other Titan as I've RMA'd
>> it.
>> >> Dunno how long it will take as my CPU is only a quad core i7 :(
>> >>
>> >> Will post my results back when done.
>> >>
>> >> br,
>> >> g
>> >>
>> >>
>> >>
>> >>
>> >> On 30 May 2013 03:42, Jason Swails <jason.swails.gmail.com> wrote:
>> >>
>> >>> On Wed, May 29, 2013 at 6:00 PM, Marek Maly <marek.maly.ujep.cz>
>> wrote:
>> >>>
>> >>> > Hi Jason,
>> >>> >
>> >>> > thanks for the explanation but to be frank I did not understand
>> the
>> >>> main
>> >>> > idea.
>> >>> >
>> >>>
>> >>> I'll try to explain a little bit (but perhaps it's better to just
>> take
>> >>> Scott's advice and trust him on that). The problem is that
>> addition of
>> >>> floating point numbers in computers is not strictly associative.
>> That
>> >>> is,
>> >>> a + (b + c) != (a + b) + c, due to round-off issues in the last
>> decimal
>> >>> place or so. As a result, the numerical result of a summation on a
>> >>> computer depends on the _order_ in which those numbers are added.
>> If
>> >>> you
>> >>> change the 'order of operations,' then you risk changing the exact
>> >>> value
>> >>> of
>> >>> the result in the last stored decimal. See the wikipedia page on
>> >>> Floating
>> >>> point accuracy:
>> >>> https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
>> >>>
>> >>> Since the force calculation and energy calculation follow different
>> >>> code
>> >>> paths, the 'order of operations' differs between the two routines.
>> As a
>> >>> result, the exact forces may vary a tinytinytiny bit depending on
>> >>> whether
>> >>> the force or energy routine was called. This difference is tiny and
>> >>> negligible, but since classical systems of <2 bodies are chaotic
>> these
>> >>> differences eventually manifest as completely different
>> trajectories.
>> >>>
>> >>> As Scott said, this difference is expected, unavoidable, and
>> >>> conveniently
>> >>> unimportant. (In fact, some may argue it's a _good thing_).
>> >>>
>> >>>
>> >>> >
>> >>> > I understand that for system evolution by Molecular Dynamics is
>> not
>> >>> > necessary to calculate energy
>> >>> > just forces and so that energy is calculated only when explicitly
>> >>> > requested (i.e. with NTPR step period) but what I have problem to
>> >>> > understand is why the printed (in mdout file) immediate energy
>> value
>> >>> E(i)
>> >>> > at step "i" should be dependent on the number of my "Energy
>> requests"
>> >>> > before the simulation reached step "i" (i.e. dependent on NTPR
>> >>> value)? I
>> >>> > naturally assume that my energy requests do not influence
>> evolution
>> >>> of
>> >>> my
>> >>> > molecular system by Molecular Dynamics (e.g. do not influence
>> forces
>> >>> ...).
>> >>> > I see NTPR parameter just as the period in which some function
>> >>> > "CALCULATE_ENERGIES" is called to calculate all the energy
>> >>> components of
>> >>> > the simulated system in given moment, that's all, but perhaps I am
>> >>> not
>> >>> > right here ?
>> >>> >
>> >>> > How exactly "ene_avg_sampling" parameter is connected with "NTPR"
>> >>> > parameter ?
>> >>> >
>> >>>
>> >>> Like the "ntpr" parameter, the ene_avg_sampling variable tells pmemd
>> >>> how
>> >>> frequently you _want_ it to calculate energies. If
>> ene_avg_sampling is
>> >>> set
>> >>> to 10, then pmemd.cuda will compute energies every 10 steps so they
>> >>> can be
>> >>> averaged. If ntpr is any multiple of 10, then pmemd.cuda will still
>> >>> compute energies _only_ every 10 steps (so that it can be averaged
>> that
>> >>> often). As a result, the code path is dictated by the fact that
>> >>> ene_avg_sampling is 10 rather than by the value of ntpr.
>> >>>
>> >>> I hope this clarified things a little bit...
>> >>>
>> >>> Jason
>> >>>
>> >>> --
>> >>> Jason M. Swails
>> >>> Quantum Theory Project,
>> >>> University of Florida
>> >>> Ph.D. Candidate
>> >>> 352-392-4032
>> >>> _______________________________________________
>> >>> AMBER mailing list
>> >>> AMBER.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>
>> >>
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
>> > (20130530) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
> (20130530) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>

-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu May 30 2013 - 08:30:02 PDT