Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 30 May 2013 09:55:44 -0700

I am very interested in the results of ypur downclocking. Let us all
know...
On May 30, 2013 8:09 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:

> OK,
> thanks a lot for info !
>
> Something like year ago I successfully RMAed 2 x GTX 580, but in that case
> I could argued with records of many errors obtained using memtestG80.
>
> So I simply described problem, attached memtestG80 outputs and that's
> was enough for them. Here (in Titan case) the situation is not so
> clear/transparent
> but I believe that eventually also in this case I can eventually succeeded
> with RMAing of my titans.
> But I am not sure which is the probability that the new 2 GPUs will be at
> least as
> good as those which I RMAed . I don't have time just for solving
> repeatedly some RMAs issues ...
>
> Anyway my strange (non-black and white) results from Amber tests (as well
> as negative memtestG80
> tests) suggest that cards are maybe not damaged and simple downclock to
> stock frequency might solve
> the problem. Simply 928 MHz on Titan might be still OK for gaming but on
> the other hand
> this frequency is already "risky" for reliable scientific computation
> especially those long term
> ones (days/weeks) and especially in case of some softwares (here Amber).
>
> The idea that downclocking might help here is eventually supported by my
> hypothesis that for factory overclocking just the best/well tested chips
> are selected, but maybe I am too idealistic here :))
> It's just really pity that NVclock is "impotent" here and one has to edit
> directly GPU BIOS to
> downclock GPU under linux. But I would wait with this eventual step as the
> very last eventual possibility. I am waiting with some hope for the patch
> which Ross announced.
>
> BTW can you please confirm that also your overclocked Titan is
> automatically running under
> boost frequency = 928 MHz ? You may find out this using deviceQuery
> routine which is the part
> of CUDA samples but you can also find the actual GPU frequency written in
> Amber mdout files.
>
> see here my example from one such file:
>
> |------------------- GPU DEVICE INFO --------------------
> |
> | CUDA Capable Devices Detected: 1
> | CUDA Device ID in use: 0
> | CUDA Device Name: GeForce GTX TITAN
> | CUDA Device Global Mem Size: 6143 MB
> | CUDA Device Num Multiprocessors: 14
> | CUDA Device Core Freq: 0.93 GHz
> |
> |--------------------------------------------------------
>
>
> Thanks in advance !
>
> M.
>
>
>
> Dne Thu, 30 May 2013 16:06:17 +0200 ET <sketchfoot.gmail.com> napsal/-a:
>
> > Hi,
> >
> > I don't think it's particularly lucky. :) The evidence pointed clearly to
> > the hardware being faulty IMO. I RMA'd aprox three weeks after putchase,
> > so
> > I was out of my 7 day period (UK) where I can return if I don't like the
> > color. Where did you get your card from? Is it harder to get an RMA in
> > the
> > country that you are based? I have heard (don't know how true it is) that
> > it harder to do this in the states?
> >
> > I don't imagine they did anything more than run Heaven and Valley
> > benchmarks. If it was a manufacture supplied test, then the manufacturer
> > would have caught it before it was sent out for sale, and I can't
> > imagine a
> > store developing their own test, though I may be wrong on that.
> >
> > I will post my benchmark results asap, though it may be tommorrow.
> >
> > I hope you get your card sorted out too! :)
> >
> > FYI: My RMA request as follows:
> >
> >
> > #######################################
> > My Setup is as follows: i7-930 intel Quad core CPU, 6GB RAM on a Gigabyte
> > GA-C58-UD7 motherboard. I have two NVIDIA GPUs installed: 1x EVGA
> > superclocked Geforce Titan and the other (one that I wish to return)
> > which
> > is a standard (not overclocked) EVGA Geforce Titan. I'm not running an
> > SLI
> > setup and use the GPUs for running Bio-physical simulations. The system
> > runs headless without any GUI and thus no display. This makes it a pure
> > compute card and thus any errors are related to this rather than display
> > misconfigurations.
> >
> > I have had the superclocked geforce for a longer time and have been
> > benchmarking it against a standard test simulation without any issues. On
> > receiving the standard geforce, I realised that it was crashing
> > catastrophically (after 10- 15mins) whilst running the same benchmark
> > that
> > the other card did not have a problem with.
> >
> > I verified that this card was faulty by swapping the cards around so they
> > occupied their partners PCI-e slot (so still in a dual GPU
> > configuration).
> > The problem persisted. So I took the superclocked card out and tested the
> > card on its own in first one Pci-e slot, the the other. As the problem
> > has
> > not gone away and the other card tested did not have a problem with the
> > bechmark, my conclusiion is that the standard Geforce is faulty.
> >
> > I would like to return the card for a replacement. If it is at all
> > possible, could I get another superclocked EVGA? I am happy to pay the
> > price difference.
> > ###########################################
> >
> > br,
> > g
> >
> >
> > On 30 May 2013 13:32, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> >> Lucky guy ! :))
> >>
> >> I am just curious which was your original justification
> >> for RMA of that Titan. How did you argued here ? Just
> >> using Amber instability calc. arguments or you also found
> >> some errors during another common tests like memtestG80,
> >> cuda_memtest, gpu_burn and/or some common Win performance testers
> >> (Heaven, 3DMark ...) ?
> >>
> >> Would be nice to know the name of the test which returns technicians
> >> used
> >> and which clearly and undoubtedly proved that the given GPU is
> >> defective.
> >>
> >> How long after purchase you RMAed this card ?
> >>
> >> I am also curious on your reproducibility Amber benchmark tests. Now I
> >> am
> >> doing
> >> 500k steps long ones with updated driver 319.23 and for the moment
> >> it does not seem that driver update solved the problems :((
> >>
> >> Marek
> >>
> >>
> >>
> >>
> >> Dne Thu, 30 May 2013 14:08:18 +0200 ET <sketchfoot.gmail.com>
> napsal/-a:
> >>
> >> > An update:
> >> >
> >> > Just got a mail from ebuyer who said:
> >> >
> >> > Following extensive tests by our returns technicians, this item was
> >> found
> >> > to be faulty. A replacement product will be dispatched as soon as the
> >> RMA
> >> > is closed.
> >> >
> >> > For more details check the My Orders section of www.ebuyer.com
> >> >
> >> > Kind regards,
> >> >
> >> > Ebuyer Customer Support
> >> >
> >> >
> >> >
> >> > On 30 May 2013 09:33, ET <sketchfoot.gmail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I believe this was the specific driver I used:
> >> >>
> >> >> http://www.nvidia.com/object/linux-display-amd64-313.30-driver.html
> >> >>
> >> >>
> >> >> I'm running the benchmark now on the super-duper-clocked geforce
> >> that I
> >> >> believe is "working". I can't do it on the other Titan as I've RMA'd
> >> it.
> >> >> Dunno how long it will take as my CPU is only a quad core i7 :(
> >> >>
> >> >> Will post my results back when done.
> >> >>
> >> >> br,
> >> >> g
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 30 May 2013 03:42, Jason Swails <jason.swails.gmail.com> wrote:
> >> >>
> >> >>> On Wed, May 29, 2013 at 6:00 PM, Marek Maly <marek.maly.ujep.cz>
> >> wrote:
> >> >>>
> >> >>> > Hi Jason,
> >> >>> >
> >> >>> > thanks for the explanation but to be frank I did not understand
> >> the
> >> >>> main
> >> >>> > idea.
> >> >>> >
> >> >>>
> >> >>> I'll try to explain a little bit (but perhaps it's better to just
> >> take
> >> >>> Scott's advice and trust him on that). The problem is that
> >> addition of
> >> >>> floating point numbers in computers is not strictly associative.
> >> That
> >> >>> is,
> >> >>> a + (b + c) != (a + b) + c, due to round-off issues in the last
> >> decimal
> >> >>> place or so. As a result, the numerical result of a summation on a
> >> >>> computer depends on the _order_ in which those numbers are added.
> >> If
> >> >>> you
> >> >>> change the 'order of operations,' then you risk changing the exact
> >> >>> value
> >> >>> of
> >> >>> the result in the last stored decimal. See the wikipedia page on
> >> >>> Floating
> >> >>> point accuracy:
> >> >>> https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
> >> >>>
> >> >>> Since the force calculation and energy calculation follow different
> >> >>> code
> >> >>> paths, the 'order of operations' differs between the two routines.
> >> As a
> >> >>> result, the exact forces may vary a tinytinytiny bit depending on
> >> >>> whether
> >> >>> the force or energy routine was called. This difference is tiny and
> >> >>> negligible, but since classical systems of <2 bodies are chaotic
> >> these
> >> >>> differences eventually manifest as completely different
> >> trajectories.
> >> >>>
> >> >>> As Scott said, this difference is expected, unavoidable, and
> >> >>> conveniently
> >> >>> unimportant. (In fact, some may argue it's a _good thing_).
> >> >>>
> >> >>>
> >> >>> >
> >> >>> > I understand that for system evolution by Molecular Dynamics is
> >> not
> >> >>> > necessary to calculate energy
> >> >>> > just forces and so that energy is calculated only when explicitly
> >> >>> > requested (i.e. with NTPR step period) but what I have problem to
> >> >>> > understand is why the printed (in mdout file) immediate energy
> >> value
> >> >>> E(i)
> >> >>> > at step "i" should be dependent on the number of my "Energy
> >> requests"
> >> >>> > before the simulation reached step "i" (i.e. dependent on NTPR
> >> >>> value)? I
> >> >>> > naturally assume that my energy requests do not influence
> >> evolution
> >> >>> of
> >> >>> my
> >> >>> > molecular system by Molecular Dynamics (e.g. do not influence
> >> forces
> >> >>> ...).
> >> >>> > I see NTPR parameter just as the period in which some function
> >> >>> > "CALCULATE_ENERGIES" is called to calculate all the energy
> >> >>> components of
> >> >>> > the simulated system in given moment, that's all, but perhaps I am
> >> >>> not
> >> >>> > right here ?
> >> >>> >
> >> >>> > How exactly "ene_avg_sampling" parameter is connected with "NTPR"
> >> >>> > parameter ?
> >> >>> >
> >> >>>
> >> >>> Like the "ntpr" parameter, the ene_avg_sampling variable tells pmemd
> >> >>> how
> >> >>> frequently you _want_ it to calculate energies. If
> >> ene_avg_sampling is
> >> >>> set
> >> >>> to 10, then pmemd.cuda will compute energies every 10 steps so they
> >> >>> can be
> >> >>> averaged. If ntpr is any multiple of 10, then pmemd.cuda will still
> >> >>> compute energies _only_ every 10 steps (so that it can be averaged
> >> that
> >> >>> often). As a result, the code path is dictated by the fact that
> >> >>> ene_avg_sampling is 10 rather than by the value of ntpr.
> >> >>>
> >> >>> I hope this clarified things a little bit...
> >> >>>
> >> >>> Jason
> >> >>>
> >> >>> --
> >> >>> Jason M. Swails
> >> >>> Quantum Theory Project,
> >> >>> University of Florida
> >> >>> Ph.D. Candidate
> >> >>> 352-392-4032
> >> >>> _______________________________________________
> >> >>> AMBER mailing list
> >> >>> AMBER.ambermd.org
> >> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >> >>>
> >> >>
> >> >>
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
> >> > (20130530) __________
> >> >
> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >
> >> > http://www.eset.cz
> >> >
> >> >
> >> >
> >>
> >>
> >> --
> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> http://www.opera.com/mail/
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
> > (20130530) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 10:00:02 PDT
Custom Search