Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-05-28 (Amber Archive May 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Tue, 28 May 2013 13:13:36 -0700

Marek,
Your GPU is hosed. I don't have anything else to add. I'm not going to go
snark hunting for a bug that doesn't exist.

On Tue, May 28, 2013 at 12:24 PM, Marek Maly <marek.maly.ujep.cz> wrote:

> Hi, just for the curiosity which driver are you using
> on that machine with perfectly working with OC TITAN,
> 319.17 or some more actual e.g. 319.23 ?
>
> RMA is a good idea but it could be also long time story and
> also to succeed here you need to have strong arguments
> especially if you are going to RMA two OC TITANs.
>
> I am not sure if my arguments "The cards have problems with some Amber
> calculations"
> would be strong enough here. Would be much better to have clear results
> from
> respected GPU tests and as it seems you may do extensive GPU tests also
> with
> multiple routines without any errors but still have problems with
> particular
> Amber simulations...
>
> BTW I am now doing Amber benchmarks with nstlim=100K and ig=default for
> each card
> twice. The tests will be done in cca 3 hours (due to slow nucleosome GB
> test).
>
> But even now I have interesting results from the first test on GPU0
> (nucleosome is still running) see below.
>
> As you can see JAC_NPT crashed around 11000 step, here is the last md.out
> record:
>
> *********
>
> ------------------------------------------------------------------------------
>
> check COM velocity, temp: 0.000021 0.00(Removed)
>
> NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39 PRESS =
> -9.4
> Etot = -58092.8958 EKtot = 14440.2520 EPtot =
> -72533.1478
> BOND = 443.3912 ANGLE = 1253.5177 DIHED =
> 970.1275
> 1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
> 8664.9960
> EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
> 236141.9494
> Density =
> 1.0162
>
> ------------------------------------------------------------------------------
>
> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
>
> ********
>
> Any idea about that ERROR ?
>
> On the other hand FACTOR_IX_NPT which has much more atoms passed without
> any issue.
>
> Cellulose crashed on the beginning without any ERROR message in md.out
> file.
>
>
> I am very curious regarding exact reproducibility of the results at least
> in the
> framework of both tests on individual cards.
>
> BTW regarding eventual downclocking, has anyone idea about some NVclock
> alternative or
> I will be really eventually forced to edit frequency value in GPU BIOS ?
>
> Best,
>
> Marek
>
> HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests
>
> JAC_PRODUCTION_NVE - 23,558 atoms PME
> -------------------------------------
>
> 1 x GTX_TITAN: | ns/day = 115.91 seconds/ns =
> 745.39
>
> JAC_PRODUCTION_NPT - 23,558 atoms PME
> -------------------------------------
>
> 1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
> | ns/day = 90.72 seconds/ns = 952.42
>
> FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
> -------------------------------------------
>
> 1 x GTX_TITAN: | ns/day = 30.56 seconds/ns =
> 2827.33
>
> FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
> -------------------------------------------
>
> 1 x GTX_TITAN: | ns/day = 25.01 seconds/ns =
> 3454.56
>
> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
> --------------------------------------------
>
> 1 x GTX_TITAN: Error: unspecified launch failure launching kernel
> kNLSkinTest
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> grep: mdinfo.1GTX_TITAN: No such file or directory
>
> TRPCAGE_PRODUCTION - 304 atoms GB
> ---------------------------------
> 1 x GTX_TITAN: | ns/day = 595.09 seconds/ns =
> 145.19
>
> MYOGLOBIN_PRODUCTION - 2,492 atoms GB
> -------------------------------------
>
> 1 x GTX_TITAN: | ns/day = 202.56 seconds/ns =
> 426.53
>
> NUCLEOSOME_PRODUCTION - 25,095 atoms GB
> ---------------------------------------
>
> 1 x GTX_TITAN:
>
>
>
>
>
>
>
> Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com> napsal/-a:
>
> > Hi,
> >
> > I just got a superclocked Titan and one at normal freq. The first one ran
> > like a charm with no issues so far. The other standard clocked one could
> > never get past the constant pressure stage in an NPT simulation. It kept
> > writing NAN or ********* in the outfile. I swapped them about in the pcie
> > lanes then ran it solo in each one of the lanes. Despite all this it was
> > still failing the benchmark that the other one had no problems with.
> >
> > I couldn't find any memory errors with GPU-burn either, but as they cost
> > near a grand a piece, I RMA'd it today. I recommend you to do the same if
> > its not giving you any joy. Life's too short. :)
> >
> > br,
> > g
> >
> >
> > On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com> wrote:
> >
> >> AMBER != NAMD...
> >>
> >> GTX 680 != GTX Titan...
> >>
> >> Ian's suggestion is a good one. But even then, you need to test your
> >> GPUs
> >> as the Titans are running right on the edge of stability. Like I told
> >> Marek, try running 100K iterations of Cellulose NVE twice with the same
> >> random seed. if you don't get identically bit accurate output, your
> >> GPU is
> >> not working. Memtest programs do not catch this because (I am guessing)
> >> they are designed for a uniform memory hierarchy and only one path to
> >> read
> >> and write data. I have a stock GTX Titan that cannot pass the Cellulose
> >> NVE test and another one that does. I spent a couple days on the former
> >> GPU looking for the imaginary bug that went away like magic the second I
> >> switched out the GPU.
> >>
> >> Scott
> >>
> >>
> >>
> >>
> >>
> >> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu> wrote:
> >>
> >> > Hi Scott,
> >> >
> >> > unfortunately we are seeing similar Amber instability on GTX Titans as
> >> > Marek is. We have a box with four GTX Titans (not oveclocked) running
> >> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
> >> simulation
> >> > longer than 10-15 min eventually crashes on these cards, including
> >> both
> >> JAC
> >> > benchmarks (with extended run time). This is reproducible on all four
> >> > cards.
> >> >
> >> > To eliminate the possible hardware error we ran extended GPU memory
> >> tests
> >> > on all four Titans with memtestG80, cuda_memtest and also gpu_burn -
> >> all
> >> > finished without errors. Since I agree that these programs may not
> >> test
> >> the
> >> > GPU completely we also set up simulations with NAMD. We can run four
> >> NAMD
> >> > simulations simultaneously for many days without any errors on this
> >> > hardware. For reference - we also have exactly the same server with
> >> the
> >> > same hardware components but with four GTX680s and this setup works
> >> just
> >> > fine for Amber. So all this leads me to believe that a hardware error
> >> is
> >> > not very likely.
> >> >
> >> > I would appreciate your comments on this, perhaps there is something
> >> else
> >> > causing these errors which we are not seeing.
> >> >
> >> > Thanks,
> >> >
> >> > Robert
> >> >
> >> >
> >> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
> >> > > I have two GTX Titans. One is defective, the other is not.
> >> > Unfortunately,
> >> > > they both pass all standard GPU memory tests.
> >> > >
> >> > > What the defective one doesn't do is generate reproducibly
> >> bit-accurate
> >> > > outputs for simulations of Factor IX (90,986 atoms) or larger, of
> >> 100K
> >> or
> >> > > so iterations.
> >> > >
> >> > > Which is yet another reason why I insist on MD algorithms
> >> (especially
> >> on
> >> > > GPUS) being deterministic. Besides its ability to find software
> >> bugs,
> >> > and
> >> > > fulfilling one of the most important tenets of science, it's a great
> >> way
> >> > to
> >> > > diagnose defective hardware with very little effort.
> >> > >
> >> > > 928 MHz? That's 6% above the boost clock of a stock Titan. Titan
> >> is
> >> > > pushing the performance envelope as is. If you're going to pay the
> >> > premium
> >> > > for such chips, I'd send them back until you get one that runs
> >> correctly.
> >> > > I'm very curious how fast you can push one of these things before
> >> they
> >> > give
> >> > > out.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly <marek.maly.ujep.cz>
> >> wrote:
> >> > >
> >> > > > Dear all,
> >> > > >
> >> > > > I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.
> >> > > >
> >> > > > I did the first calculations (pmemd.cuda in Amber12) with systems
> >> > around
> >> > > > 60K atoms without any problems (NPT, Langevin), but when I later
> >> tried
> >> > > > with bigger systems (around 100K atoms) I obtained "classical"
> >> > irritating
> >> > > > errors
> >> > > >
> >> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> >> > > >
> >> > > > just after few thousands of MD steps.
> >> > > >
> >> > > > So this was obviously the reason for memtestG80 tests.
> >> > > > ( https://simtk.org/home/memtest ).
> >> > > >
> >> > > > So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz
> >> )
> >> and
> >> > > > then tested
> >> > > > just small part of memory GPU (200 MB) using 100 iterations.
> >> > > >
> >> > > > On both cards I have obtained huge amount of errors but "just" on
> >> > > > "Random blocks:". 0 errors in all remaining tests in all
> >> iterations.
> >> > > >
> >> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
> >> > > >
> >> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
> >> > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
> >> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
> >> > > > True Walking zeros (8-bit): 0 errors (26 ms)
> >> > > > True Walking ones (8-bit): 0 errors (26 ms)
> >> > > > Moving Inversions (random): 0 errors (6 ms)
> >> > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
> >> > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
> >> > > > Random blocks: 1369863 errors (27 ms)
> >> > > > Memtest86 Modulo-20: 0 errors (215 ms)
> >> > > > Logic (one iteration): 0 errors (4 ms)
> >> > > > Logic (4 iterations): 0 errors (8 ms)
> >> > > > Logic (shared memory, one iteration): 0 errors (8 ms)
> >> > > > Logic (shared-memory, 4 iterations): 0 errors (25 ms)
> >> > > >
> >> > > > Final error count after 100 iterations over 200 MiB of GPU memory:
> >> > > > 171106710 errors
> >> > > >
> >> > > > ------------------------------------------
> >> > > >
> >> > > > I have some questions and would be really grateful for any
> >> comments.
> >> > > >
> >> > > > Regarding overclocking, using the deviceQuery I found out that
> >> under
> >> > linux
> >> > > > both cards run
> >> > > > automatically using boost shader/GPU frequency which is here 928
> >> MHz
> >> > (the
> >> > > > basic value for these factory OC cards is 876 MHz). deviceQuery
> >> > reported
> >> > > > Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but
> >> > maybe
> >> > > > the quantity which is reported by deviceQuery "Memory Clock rate"
> >> is
> >> > > > different from the product specification "Memory Clock" . It seems
> >> that
> >> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just
> >> > deviceQuery
> >> > > > is not able to read this spec. properly
> >> > > > in Titan GPU ?
> >> > > >
> >> > > > Anyway for the moment I assume that the problem might be due to
> >> the
> >> > high
> >> > > > shader/GPU frequency.
> >> > > > (see here : http://folding.stanford.edu/English/DownloadUtils )
> >> > > >
> >> > > > To verify this hypothesis one should perhaps UNDERclock to basic
> >> > frequency
> >> > > > which is in this
> >> > > > model 876 MHz or even to the TITAN REFERENCE frequency which is
> >> 837
> >> > MHz.
> >> > > >
> >> > > > Obviously I am working with these cards under linux (CentOS
> >> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under
> >> linux
> >> > are in
> >> > > > fact limited just to NVclock utility, which is unfortunately
> >> > > > out of date (at least speaking about the GTX Titan ). I have
> >> obtained
> >> > this
> >> > > > message when I wanted
> >> > > > just to let NVclock utility to read and print shader and memory
> >> > > > frequencies of my Titan's:
> >> > > >
> >> > > >
> >> -------------------------------------------------------------------
> >> > > >
> >> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
> >> > > > Card: Unknown Nvidia card
> >> > > > Card number: 1
> >> > > > Memory clock: -2147483.750 MHz
> >> > > > GPU clock: -2147483.750 MHz
> >> > > >
> >> > > > Card: Unknown Nvidia card
> >> > > > Card number: 2
> >> > > > Memory clock: -2147483.750 MHz
> >> > > > GPU clock: -2147483.750 MHz
> >> > > >
> >> > > >
> >> > > >
> >> -------------------------------------------------------------------
> >> > > >
> >> > > >
> >> > > > I would be really grateful for some tips regarding "NVclock
> >> > alternatives",
> >> > > > but after wasting some hours with googling it seems that there is
> >> no
> >> > other
> >> > > > Linux
> >> > > > tool with NVclock functionality. So the only possibility is here
> >> > perhaps
> >> > > > to edit
> >> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker,
> >> > NVflash)
> >> > > > but obviously
> >> > > > I would like to rather avoid such approach as using it means
> >> perhaps
> >> > also
> >> > > > to void the warranty even if I am going to underclock the GPUs
> >> not to
> >> > > > overclock them.
> >> > > > So before this eventual step (GPU bios editing) I would like to
> >> have
> >> > some
> >> > > > approximative estimate
> >> > > > of the probability, that the problems are here really because of
> >> the
> >> > > > overclocking
> >> > > > (too high (boost) default shader frequency).
> >> > > >
> >> > > > This probability I hope to estimate from the eventual responses of
> >> > another
> >> > > > Amber/Titan SC users, if I am not the only crazy guy who bought
> >> this
> >> > model
> >> > > > for Amber calculations :)) But of course any eventual experiences
> >> with
> >> > > > Titan cards related to their memtestG80 results and
> >> UNDER/OVERclocking
> >> > > > (if possible in Linux OS) are of course welcomed as well !
> >> > > >
> >> > > > My HW/SW configuration
> >> > > >
> >> > > > motherboard: ASUS P9X79 PRO
> >> > > > CPU: Intel Core i7-3930K
> >> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
> >> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
> >> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
> >> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
> >> > > > cooler: Cooler Master Hyper 412 SLIM
> >> > > >
> >> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
> >> > > > driver version: 319.17
> >> > > > cudatoolkit_5.0.35_linux_64_rhel6.x
> >> > > >
> >> > > > The computer is in air-conditioned room with permanent external
> >> > > > temperature around 18°C
> >> > > >
> >> > > >
> >> > > > Thanks a lot in advance for any comment/experience !
> >> > > >
> >> > > > Best wishes,
> >> > > >
> >> > > > Marek
> >> > > >
> >> > > > --
> >> > > > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> > > > http://www.opera.com/mail/
> >> > > >
> >> > > > _______________________________________________
> >> > > > AMBER mailing list
> >> > > > AMBER.ambermd.org
> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >
> >> > > _______________________________________________
> >> > > AMBER mailing list
> >> > > AMBER.ambermd.org
> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8385
> > (20130528) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 28 2013 - 13:30:03 PDT