Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-05-28 (Amber Archive May 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Tue, 28 May 2013 12:01:20 -0700

And that's exactly the way to handle this IMO! Life is indeed too short.

On Tue, May 28, 2013 at 11:42 AM, ET <sketchfoot.gmail.com> wrote:

> Hi,
>
> I just got a superclocked Titan and one at normal freq. The first one ran
> like a charm with no issues so far. The other standard clocked one could
> never get past the constant pressure stage in an NPT simulation. It kept
> writing NAN or ********* in the outfile. I swapped them about in the pcie
> lanes then ran it solo in each one of the lanes. Despite all this it was
> still failing the benchmark that the other one had no problems with.
>
> I couldn't find any memory errors with GPU-burn either, but as they cost
> near a grand a piece, I RMA'd it today. I recommend you to do the same if
> its not giving you any joy. Life's too short. :)
>
> br,
> g
>
>
> On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com> wrote:
>
> > AMBER != NAMD...
> >
> > GTX 680 != GTX Titan...
> >
> > Ian's suggestion is a good one. But even then, you need to test your
> GPUs
> > as the Titans are running right on the edge of stability. Like I told
> > Marek, try running 100K iterations of Cellulose NVE twice with the same
> > random seed. if you don't get identically bit accurate output, your GPU
> is
> > not working. Memtest programs do not catch this because (I am guessing)
> > they are designed for a uniform memory hierarchy and only one path to
> read
> > and write data. I have a stock GTX Titan that cannot pass the Cellulose
> > NVE test and another one that does. I spent a couple days on the former
> > GPU looking for the imaginary bug that went away like magic the second I
> > switched out the GPU.
> >
> > Scott
> >
> >
> >
> >
> >
> > On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu> wrote:
> >
> > > Hi Scott,
> > >
> > > unfortunately we are seeing similar Amber instability on GTX Titans as
> > > Marek is. We have a box with four GTX Titans (not oveclocked) running
> > > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
> simulation
> > > longer than 10-15 min eventually crashes on these cards, including both
> > JAC
> > > benchmarks (with extended run time). This is reproducible on all four
> > > cards.
> > >
> > > To eliminate the possible hardware error we ran extended GPU memory
> tests
> > > on all four Titans with memtestG80, cuda_memtest and also gpu_burn -
> all
> > > finished without errors. Since I agree that these programs may not test
> > the
> > > GPU completely we also set up simulations with NAMD. We can run four
> NAMD
> > > simulations simultaneously for many days without any errors on this
> > > hardware. For reference - we also have exactly the same server with the
> > > same hardware components but with four GTX680s and this setup works
> just
> > > fine for Amber. So all this leads me to believe that a hardware error
> is
> > > not very likely.
> > >
> > > I would appreciate your comments on this, perhaps there is something
> else
> > > causing these errors which we are not seeing.
> > >
> > > Thanks,
> > >
> > > Robert
> > >
> > >
> > > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
> > > > I have two GTX Titans. One is defective, the other is not.
> > > Unfortunately,
> > > > they both pass all standard GPU memory tests.
> > > >
> > > > What the defective one doesn't do is generate reproducibly
> bit-accurate
> > > > outputs for simulations of Factor IX (90,986 atoms) or larger, of
> 100K
> > or
> > > > so iterations.
> > > >
> > > > Which is yet another reason why I insist on MD algorithms (especially
> > on
> > > > GPUS) being deterministic. Besides its ability to find software
> bugs,
> > > and
> > > > fulfilling one of the most important tenets of science, it's a great
> > way
> > > to
> > > > diagnose defective hardware with very little effort.
> > > >
> > > > 928 MHz? That's 6% above the boost clock of a stock Titan. Titan is
> > > > pushing the performance envelope as is. If you're going to pay the
> > > premium
> > > > for such chips, I'd send them back until you get one that runs
> > correctly.
> > > > I'm very curious how fast you can push one of these things before
> they
> > > give
> > > > out.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly <marek.maly.ujep.cz>
> > wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.
> > > > >
> > > > > I did the first calculations (pmemd.cuda in Amber12) with systems
> > > around
> > > > > 60K atoms without any problems (NPT, Langevin), but when I later
> > tried
> > > > > with bigger systems (around 100K atoms) I obtained "classical"
> > > irritating
> > > > > errors
> > > > >
> > > > > cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> > > > >
> > > > > just after few thousands of MD steps.
> > > > >
> > > > > So this was obviously the reason for memtestG80 tests.
> > > > > ( https://simtk.org/home/memtest ).
> > > > >
> > > > > So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz )
> > and
> > > > > then tested
> > > > > just small part of memory GPU (200 MB) using 100 iterations.
> > > > >
> > > > > On both cards I have obtained huge amount of errors but "just" on
> > > > > "Random blocks:". 0 errors in all remaining tests in all
> iterations.
> > > > >
> > > > > ------THE LAST ITERATION AND FINAL RESULTS-------
> > > > >
> > > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
> > > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
> > > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
> > > > > True Walking zeros (8-bit): 0 errors (26 ms)
> > > > > True Walking ones (8-bit): 0 errors (26 ms)
> > > > > Moving Inversions (random): 0 errors (6 ms)
> > > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
> > > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
> > > > > Random blocks: 1369863 errors (27 ms)
> > > > > Memtest86 Modulo-20: 0 errors (215 ms)
> > > > > Logic (one iteration): 0 errors (4 ms)
> > > > > Logic (4 iterations): 0 errors (8 ms)
> > > > > Logic (shared memory, one iteration): 0 errors (8 ms)
> > > > > Logic (shared-memory, 4 iterations): 0 errors (25 ms)
> > > > >
> > > > > Final error count after 100 iterations over 200 MiB of GPU memory:
> > > > > 171106710 errors
> > > > >
> > > > > ------------------------------------------
> > > > >
> > > > > I have some questions and would be really grateful for any
> comments.
> > > > >
> > > > > Regarding overclocking, using the deviceQuery I found out that
> under
> > > linux
> > > > > both cards run
> > > > > automatically using boost shader/GPU frequency which is here 928
> MHz
> > > (the
> > > > > basic value for these factory OC cards is 876 MHz). deviceQuery
> > > reported
> > > > > Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but
> > > maybe
> > > > > the quantity which is reported by deviceQuery "Memory Clock rate"
> is
> > > > > different from the product specification "Memory Clock" . It seems
> > that
> > > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just
> > > deviceQuery
> > > > > is not able to read this spec. properly
> > > > > in Titan GPU ?
> > > > >
> > > > > Anyway for the moment I assume that the problem might be due to the
> > > high
> > > > > shader/GPU frequency.
> > > > > (see here : http://folding.stanford.edu/English/DownloadUtils )
> > > > >
> > > > > To verify this hypothesis one should perhaps UNDERclock to basic
> > > frequency
> > > > > which is in this
> > > > > model 876 MHz or even to the TITAN REFERENCE frequency which is 837
> > > MHz.
> > > > >
> > > > > Obviously I am working with these cards under linux (CentOS
> > > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under linux
> > > are in
> > > > > fact limited just to NVclock utility, which is unfortunately
> > > > > out of date (at least speaking about the GTX Titan ). I have
> obtained
> > > this
> > > > > message when I wanted
> > > > > just to let NVclock utility to read and print shader and memory
> > > > > frequencies of my Titan's:
> > > > >
> > > > > -------------------------------------------------------------------
> > > > >
> > > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
> > > > > Card: Unknown Nvidia card
> > > > > Card number: 1
> > > > > Memory clock: -2147483.750 MHz
> > > > > GPU clock: -2147483.750 MHz
> > > > >
> > > > > Card: Unknown Nvidia card
> > > > > Card number: 2
> > > > > Memory clock: -2147483.750 MHz
> > > > > GPU clock: -2147483.750 MHz
> > > > >
> > > > >
> > > > > -------------------------------------------------------------------
> > > > >
> > > > >
> > > > > I would be really grateful for some tips regarding "NVclock
> > > alternatives",
> > > > > but after wasting some hours with googling it seems that there is
> no
> > > other
> > > > > Linux
> > > > > tool with NVclock functionality. So the only possibility is here
> > > perhaps
> > > > > to edit
> > > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker,
> > > NVflash)
> > > > > but obviously
> > > > > I would like to rather avoid such approach as using it means
> perhaps
> > > also
> > > > > to void the warranty even if I am going to underclock the GPUs not
> to
> > > > > overclock them.
> > > > > So before this eventual step (GPU bios editing) I would like to
> have
> > > some
> > > > > approximative estimate
> > > > > of the probability, that the problems are here really because of
> the
> > > > > overclocking
> > > > > (too high (boost) default shader frequency).
> > > > >
> > > > > This probability I hope to estimate from the eventual responses of
> > > another
> > > > > Amber/Titan SC users, if I am not the only crazy guy who bought
> this
> > > model
> > > > > for Amber calculations :)) But of course any eventual experiences
> > with
> > > > > Titan cards related to their memtestG80 results and
> > UNDER/OVERclocking
> > > > > (if possible in Linux OS) are of course welcomed as well !
> > > > >
> > > > > My HW/SW configuration
> > > > >
> > > > > motherboard: ASUS P9X79 PRO
> > > > > CPU: Intel Core i7-3930K
> > > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
> > > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
> > > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
> > > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
> > > > > cooler: Cooler Master Hyper 412 SLIM
> > > > >
> > > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
> > > > > driver version: 319.17
> > > > > cudatoolkit_5.0.35_linux_64_rhel6.x
> > > > >
> > > > > The computer is in air-conditioned room with permanent external
> > > > > temperature around 18°C
> > > > >
> > > > >
> > > > > Thanks a lot in advance for any comment/experience !
> > > > >
> > > > > Best wishes,
> > > > >
> > > > > Marek
> > > > >
> > > > > --
> > > > > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > > > > http://www.opera.com/mail/
> > > > >
> > > > > _______________________________________________
> > > > > AMBER mailing list
> > > > > AMBER.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 28 2013 - 12:30:03 PDT