Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: ET <sketchfoot.gmail.com>
Date: Thu, 30 May 2013 18:27:13 +0100

Here is the deviceQuery result:

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN"
  CUDA Driver Version / Runtime Version 5.0 / 5.0
  CUDA Capability Major/Minor version number: 3.5
  Total amount of global memory: 6143 MBytes (6441730048
bytes)
  (14) Multiprocessors x (192) CUDA Cores/MP: 2688 CUDA Cores
  GPU Clock rate: 928 MHz (0.93 GHz)
  Memory Clock rate: 3004 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 1572864 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536),
2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size: 32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 1 copy engine(s)
  Run time limit on kernels: No
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Disabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 3 / 0
  Compute Mode:
     < Exclusive Process (many threads in one process is able to use
::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime
Version = 5.0, NumDevs = 1, Device0 = GeForce GTX TITAN

cheers,
g



On 30 May 2013 17:55, Scott Le Grand <varelse2005.gmail.com> wrote:

> I am very interested in the results of ypur downclocking. Let us all
> know...
> On May 30, 2013 8:09 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>
> > OK,
> > thanks a lot for info !
> >
> > Something like year ago I successfully RMAed 2 x GTX 580, but in that
> case
> > I could argued with records of many errors obtained using memtestG80.
> >
> > So I simply described problem, attached memtestG80 outputs and that's
> > was enough for them. Here (in Titan case) the situation is not so
> > clear/transparent
> > but I believe that eventually also in this case I can eventually
> succeeded
> > with RMAing of my titans.
> > But I am not sure which is the probability that the new 2 GPUs will be at
> > least as
> > good as those which I RMAed . I don't have time just for solving
> > repeatedly some RMAs issues ...
> >
> > Anyway my strange (non-black and white) results from Amber tests (as well
> > as negative memtestG80
> > tests) suggest that cards are maybe not damaged and simple downclock to
> > stock frequency might solve
> > the problem. Simply 928 MHz on Titan might be still OK for gaming but on
> > the other hand
> > this frequency is already "risky" for reliable scientific computation
> > especially those long term
> > ones (days/weeks) and especially in case of some softwares (here Amber).
> >
> > The idea that downclocking might help here is eventually supported by my
> > hypothesis that for factory overclocking just the best/well tested chips
> > are selected, but maybe I am too idealistic here :))
> > It's just really pity that NVclock is "impotent" here and one has to edit
> > directly GPU BIOS to
> > downclock GPU under linux. But I would wait with this eventual step as
> the
> > very last eventual possibility. I am waiting with some hope for the patch
> > which Ross announced.
> >
> > BTW can you please confirm that also your overclocked Titan is
> > automatically running under
> > boost frequency = 928 MHz ? You may find out this using deviceQuery
> > routine which is the part
> > of CUDA samples but you can also find the actual GPU frequency written in
> > Amber mdout files.
> >
> > see here my example from one such file:
> >
> > |------------------- GPU DEVICE INFO --------------------
> > |
> > | CUDA Capable Devices Detected: 1
> > | CUDA Device ID in use: 0
> > | CUDA Device Name: GeForce GTX TITAN
> > | CUDA Device Global Mem Size: 6143 MB
> > | CUDA Device Num Multiprocessors: 14
> > | CUDA Device Core Freq: 0.93 GHz
> > |
> > |--------------------------------------------------------
> >
> >
> > Thanks in advance !
> >
> > M.
> >
> >
> >
> > Dne Thu, 30 May 2013 16:06:17 +0200 ET <sketchfoot.gmail.com> napsal/-a:
> >
> > > Hi,
> > >
> > > I don't think it's particularly lucky. :) The evidence pointed clearly
> to
> > > the hardware being faulty IMO. I RMA'd aprox three weeks after
> putchase,
> > > so
> > > I was out of my 7 day period (UK) where I can return if I don't like
> the
> > > color. Where did you get your card from? Is it harder to get an RMA in
> > > the
> > > country that you are based? I have heard (don't know how true it is)
> that
> > > it harder to do this in the states?
> > >
> > > I don't imagine they did anything more than run Heaven and Valley
> > > benchmarks. If it was a manufacture supplied test, then the
> manufacturer
> > > would have caught it before it was sent out for sale, and I can't
> > > imagine a
> > > store developing their own test, though I may be wrong on that.
> > >
> > > I will post my benchmark results asap, though it may be tommorrow.
> > >
> > > I hope you get your card sorted out too! :)
> > >
> > > FYI: My RMA request as follows:
> > >
> > >
> > > #######################################
> > > My Setup is as follows: i7-930 intel Quad core CPU, 6GB RAM on a
> Gigabyte
> > > GA-C58-UD7 motherboard. I have two NVIDIA GPUs installed: 1x EVGA
> > > superclocked Geforce Titan and the other (one that I wish to return)
> > > which
> > > is a standard (not overclocked) EVGA Geforce Titan. I'm not running an
> > > SLI
> > > setup and use the GPUs for running Bio-physical simulations. The system
> > > runs headless without any GUI and thus no display. This makes it a pure
> > > compute card and thus any errors are related to this rather than
> display
> > > misconfigurations.
> > >
> > > I have had the superclocked geforce for a longer time and have been
> > > benchmarking it against a standard test simulation without any issues.
> On
> > > receiving the standard geforce, I realised that it was crashing
> > > catastrophically (after 10- 15mins) whilst running the same benchmark
> > > that
> > > the other card did not have a problem with.
> > >
> > > I verified that this card was faulty by swapping the cards around so
> they
> > > occupied their partners PCI-e slot (so still in a dual GPU
> > > configuration).
> > > The problem persisted. So I took the superclocked card out and tested
> the
> > > card on its own in first one Pci-e slot, the the other. As the problem
> > > has
> > > not gone away and the other card tested did not have a problem with the
> > > bechmark, my conclusiion is that the standard Geforce is faulty.
> > >
> > > I would like to return the card for a replacement. If it is at all
> > > possible, could I get another superclocked EVGA? I am happy to pay the
> > > price difference.
> > > ###########################################
> > >
> > > br,
> > > g
> > >
> > >
> > > On 30 May 2013 13:32, Marek Maly <marek.maly.ujep.cz> wrote:
> > >
> > >> Lucky guy ! :))
> > >>
> > >> I am just curious which was your original justification
> > >> for RMA of that Titan. How did you argued here ? Just
> > >> using Amber instability calc. arguments or you also found
> > >> some errors during another common tests like memtestG80,
> > >> cuda_memtest, gpu_burn and/or some common Win performance testers
> > >> (Heaven, 3DMark ...) ?
> > >>
> > >> Would be nice to know the name of the test which returns technicians
> > >> used
> > >> and which clearly and undoubtedly proved that the given GPU is
> > >> defective.
> > >>
> > >> How long after purchase you RMAed this card ?
> > >>
> > >> I am also curious on your reproducibility Amber benchmark tests. Now I
> > >> am
> > >> doing
> > >> 500k steps long ones with updated driver 319.23 and for the moment
> > >> it does not seem that driver update solved the problems :((
> > >>
> > >> Marek
> > >>
> > >>
> > >>
> > >>
> > >> Dne Thu, 30 May 2013 14:08:18 +0200 ET <sketchfoot.gmail.com>
> > napsal/-a:
> > >>
> > >> > An update:
> > >> >
> > >> > Just got a mail from ebuyer who said:
> > >> >
> > >> > Following extensive tests by our returns technicians, this item was
> > >> found
> > >> > to be faulty. A replacement product will be dispatched as soon as
> the
> > >> RMA
> > >> > is closed.
> > >> >
> > >> > For more details check the My Orders section of www.ebuyer.com
> > >> >
> > >> > Kind regards,
> > >> >
> > >> > Ebuyer Customer Support
> > >> >
> > >> >
> > >> >
> > >> > On 30 May 2013 09:33, ET <sketchfoot.gmail.com> wrote:
> > >> >
> > >> >> Hi,
> > >> >>
> > >> >> I believe this was the specific driver I used:
> > >> >>
> > >> >>
> http://www.nvidia.com/object/linux-display-amd64-313.30-driver.html
> > >> >>
> > >> >>
> > >> >> I'm running the benchmark now on the super-duper-clocked geforce
> > >> that I
> > >> >> believe is "working". I can't do it on the other Titan as I've
> RMA'd
> > >> it.
> > >> >> Dunno how long it will take as my CPU is only a quad core i7 :(
> > >> >>
> > >> >> Will post my results back when done.
> > >> >>
> > >> >> br,
> > >> >> g
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> On 30 May 2013 03:42, Jason Swails <jason.swails.gmail.com> wrote:
> > >> >>
> > >> >>> On Wed, May 29, 2013 at 6:00 PM, Marek Maly <marek.maly.ujep.cz>
> > >> wrote:
> > >> >>>
> > >> >>> > Hi Jason,
> > >> >>> >
> > >> >>> > thanks for the explanation but to be frank I did not understand
> > >> the
> > >> >>> main
> > >> >>> > idea.
> > >> >>> >
> > >> >>>
> > >> >>> I'll try to explain a little bit (but perhaps it's better to just
> > >> take
> > >> >>> Scott's advice and trust him on that). The problem is that
> > >> addition of
> > >> >>> floating point numbers in computers is not strictly associative.
> > >> That
> > >> >>> is,
> > >> >>> a + (b + c) != (a + b) + c, due to round-off issues in the last
> > >> decimal
> > >> >>> place or so. As a result, the numerical result of a summation on
> a
> > >> >>> computer depends on the _order_ in which those numbers are added.
> > >> If
> > >> >>> you
> > >> >>> change the 'order of operations,' then you risk changing the exact
> > >> >>> value
> > >> >>> of
> > >> >>> the result in the last stored decimal. See the wikipedia page on
> > >> >>> Floating
> > >> >>> point accuracy:
> > >> >>> https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
> > >> >>>
> > >> >>> Since the force calculation and energy calculation follow
> different
> > >> >>> code
> > >> >>> paths, the 'order of operations' differs between the two routines.
> > >> As a
> > >> >>> result, the exact forces may vary a tinytinytiny bit depending on
> > >> >>> whether
> > >> >>> the force or energy routine was called. This difference is tiny
> and
> > >> >>> negligible, but since classical systems of <2 bodies are chaotic
> > >> these
> > >> >>> differences eventually manifest as completely different
> > >> trajectories.
> > >> >>>
> > >> >>> As Scott said, this difference is expected, unavoidable, and
> > >> >>> conveniently
> > >> >>> unimportant. (In fact, some may argue it's a _good thing_).
> > >> >>>
> > >> >>>
> > >> >>> >
> > >> >>> > I understand that for system evolution by Molecular Dynamics is
> > >> not
> > >> >>> > necessary to calculate energy
> > >> >>> > just forces and so that energy is calculated only when
> explicitly
> > >> >>> > requested (i.e. with NTPR step period) but what I have problem
> to
> > >> >>> > understand is why the printed (in mdout file) immediate energy
> > >> value
> > >> >>> E(i)
> > >> >>> > at step "i" should be dependent on the number of my "Energy
> > >> requests"
> > >> >>> > before the simulation reached step "i" (i.e. dependent on NTPR
> > >> >>> value)? I
> > >> >>> > naturally assume that my energy requests do not influence
> > >> evolution
> > >> >>> of
> > >> >>> my
> > >> >>> > molecular system by Molecular Dynamics (e.g. do not influence
> > >> forces
> > >> >>> ...).
> > >> >>> > I see NTPR parameter just as the period in which some function
> > >> >>> > "CALCULATE_ENERGIES" is called to calculate all the energy
> > >> >>> components of
> > >> >>> > the simulated system in given moment, that's all, but perhaps I
> am
> > >> >>> not
> > >> >>> > right here ?
> > >> >>> >
> > >> >>> > How exactly "ene_avg_sampling" parameter is connected with
> "NTPR"
> > >> >>> > parameter ?
> > >> >>> >
> > >> >>>
> > >> >>> Like the "ntpr" parameter, the ene_avg_sampling variable tells
> pmemd
> > >> >>> how
> > >> >>> frequently you _want_ it to calculate energies. If
> > >> ene_avg_sampling is
> > >> >>> set
> > >> >>> to 10, then pmemd.cuda will compute energies every 10 steps so
> they
> > >> >>> can be
> > >> >>> averaged. If ntpr is any multiple of 10, then pmemd.cuda will
> still
> > >> >>> compute energies _only_ every 10 steps (so that it can be averaged
> > >> that
> > >> >>> often). As a result, the code path is dictated by the fact that
> > >> >>> ene_avg_sampling is 10 rather than by the value of ntpr.
> > >> >>>
> > >> >>> I hope this clarified things a little bit...
> > >> >>>
> > >> >>> Jason
> > >> >>>
> > >> >>> --
> > >> >>> Jason M. Swails
> > >> >>> Quantum Theory Project,
> > >> >>> University of Florida
> > >> >>> Ph.D. Candidate
> > >> >>> 352-392-4032
> > >> >>> _______________________________________________
> > >> >>> AMBER mailing list
> > >> >>> AMBER.ambermd.org
> > >> >>> http://lists.ambermd.org/mailman/listinfo/amber
> > >> >>>
> > >> >>
> > >> >>
> > >> > _______________________________________________
> > >> > AMBER mailing list
> > >> > AMBER.ambermd.org
> > >> > http://lists.ambermd.org/mailman/listinfo/amber
> > >> >
> > >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
> > >> > (20130530) __________
> > >> >
> > >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> > >> >
> > >> > http://www.eset.cz
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > >> http://www.opera.com/mail/
> > >>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > __________ Informace od ESET NOD32 Antivirus, verze databaze 8392
> > > (20130530) __________
> > >
> > > Tuto zpravu proveril ESET NOD32 Antivirus.
> > >
> > > http://www.eset.cz
> > >
> > >
> > >
> >
> >
> > --
> > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > http://www.opera.com/mail/
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 10:30:02 PDT
Custom Search