Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Marek Maly on 2013-05-28 (Amber Archive May 2013)

From: Marek Maly <marek.maly.ujep.cz>
Date: Tue, 28 May 2013 21:24:36 +0200

Hi, just for the curiosity which driver are you using
on that machine with perfectly working with OC TITAN,
  319.17 or some more actual e.g. 319.23 ?

RMA is a good idea but it could be also long time story and
also to succeed here you need to have strong arguments
especially if you are going to RMA two OC TITANs.

I am not sure if my arguments "The cards have problems with some Amber
calculations"
would be strong enough here. Would be much better to have clear results
from
respected GPU tests and as it seems you may do extensive GPU tests also
with
multiple routines without any errors but still have problems with
particular
Amber simulations...

BTW I am now doing Amber benchmarks with nstlim=100K and ig=default for
each card
twice. The tests will be done in cca 3 hours (due to slow nucleosome GB
test).

But even now I have interesting results from the first test on GPU0
(nucleosome is still running) see below.

As you can see JAC_NPT crashed around 11000 step, here is the last md.out
record:

*********
  ------------------------------------------------------------------------------

check COM velocity, temp: 0.000021 0.00(Removed)

  NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39 PRESS =
-9.4
  Etot = -58092.8958 EKtot = 14440.2520 EPtot =
-72533.1478
  BOND = 443.3912 ANGLE = 1253.5177 DIHED =
970.1275
  1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
8664.9960
  EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
0.0000
  EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
236141.9494
                                                     Density =
1.0162
  ------------------------------------------------------------------------------

| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!

********

Any idea about that ERROR ?

On the other hand FACTOR_IX_NPT which has much more atoms passed without
any issue.

Cellulose crashed on the beginning without any ERROR message in md.out
file.

I am very curious regarding exact reproducibility of the results at least
in the
framework of both tests on individual cards.

BTW regarding eventual downclocking, has anyone idea about some NVclock
alternative or
I will be really eventually forced to edit frequency value in GPU BIOS ?

     Best,

        Marek

HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests

JAC_PRODUCTION_NVE - 23,558 atoms PME
-------------------------------------

        1 x GTX_TITAN: | ns/day = 115.91 seconds/ns =
745.39

JAC_PRODUCTION_NPT - 23,558 atoms PME
-------------------------------------

        1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
| ns/day = 90.72 seconds/ns = 952.42

FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
-------------------------------------------

        1 x GTX_TITAN: | ns/day = 30.56 seconds/ns =
2827.33

FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
-------------------------------------------

        1 x GTX_TITAN: | ns/day = 25.01 seconds/ns =
3454.56

CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
--------------------------------------------

        1 x GTX_TITAN: Error: unspecified launch failure launching kernel
kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
grep: mdinfo.1GTX_TITAN: No such file or directory

TRPCAGE_PRODUCTION - 304 atoms GB
---------------------------------
        1 x GTX_TITAN: | ns/day = 595.09 seconds/ns =
145.19

MYOGLOBIN_PRODUCTION - 2,492 atoms GB
-------------------------------------

        1 x GTX_TITAN: | ns/day = 202.56 seconds/ns =
426.53

NUCLEOSOME_PRODUCTION - 25,095 atoms GB
---------------------------------------

        1 x GTX_TITAN:

Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com> napsal/-a:

> Hi,
>
> I just got a superclocked Titan and one at normal freq. The first one ran
> like a charm with no issues so far. The other standard clocked one could
> never get past the constant pressure stage in an NPT simulation. It kept
> writing NAN or ********* in the outfile. I swapped them about in the pcie
> lanes then ran it solo in each one of the lanes. Despite all this it was
> still failing the benchmark that the other one had no problems with.
>
> I couldn't find any memory errors with GPU-burn either, but as they cost
> near a grand a piece, I RMA'd it today. I recommend you to do the same if
> its not giving you any joy. Life's too short. :)
>
> br,
> g
>
>
> On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com> wrote:
>
>> AMBER != NAMD...
>>
>> GTX 680 != GTX Titan...
>>
>> Ian's suggestion is a good one. But even then, you need to test your
>> GPUs
>> as the Titans are running right on the edge of stability. Like I told
>> Marek, try running 100K iterations of Cellulose NVE twice with the same
>> random seed. if you don't get identically bit accurate output, your
>> GPU is
>> not working. Memtest programs do not catch this because (I am guessing)
>> they are designed for a uniform memory hierarchy and only one path to
>> read
>> and write data. I have a stock GTX Titan that cannot pass the Cellulose
>> NVE test and another one that does. I spent a couple days on the former
>> GPU looking for the imaginary bug that went away like magic the second I
>> switched out the GPU.
>>
>> Scott
>>
>>
>>
>>
>>
>> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu> wrote:
>>
>> > Hi Scott,
>> >
>> > unfortunately we are seeing similar Amber instability on GTX Titans as
>> > Marek is. We have a box with four GTX Titans (not oveclocked) running
>> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
>> simulation
>> > longer than 10-15 min eventually crashes on these cards, including
>> both
>> JAC
>> > benchmarks (with extended run time). This is reproducible on all four
>> > cards.
>> >
>> > To eliminate the possible hardware error we ran extended GPU memory
>> tests
>> > on all four Titans with memtestG80, cuda_memtest and also gpu_burn -
>> all
>> > finished without errors. Since I agree that these programs may not
>> test
>> the
>> > GPU completely we also set up simulations with NAMD. We can run four
>> NAMD
>> > simulations simultaneously for many days without any errors on this
>> > hardware. For reference - we also have exactly the same server with
>> the
>> > same hardware components but with four GTX680s and this setup works
>> just
>> > fine for Amber. So all this leads me to believe that a hardware error
>> is
>> > not very likely.
>> >
>> > I would appreciate your comments on this, perhaps there is something
>> else
>> > causing these errors which we are not seeing.
>> >
>> > Thanks,
>> >
>> > Robert
>> >
>> >
>> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
>> > > I have two GTX Titans. One is defective, the other is not.
>> > Unfortunately,
>> > > they both pass all standard GPU memory tests.
>> > >
>> > > What the defective one doesn't do is generate reproducibly
>> bit-accurate
>> > > outputs for simulations of Factor IX (90,986 atoms) or larger, of
>> 100K
>> or
>> > > so iterations.
>> > >
>> > > Which is yet another reason why I insist on MD algorithms
>> (especially
>> on
>> > > GPUS) being deterministic. Besides its ability to find software
>> bugs,
>> > and
>> > > fulfilling one of the most important tenets of science, it's a great
>> way
>> > to
>> > > diagnose defective hardware with very little effort.
>> > >
>> > > 928 MHz? That's 6% above the boost clock of a stock Titan. Titan
>> is
>> > > pushing the performance envelope as is. If you're going to pay the
>> > premium
>> > > for such chips, I'd send them back until you get one that runs
>> correctly.
>> > > I'm very curious how fast you can push one of these things before
>> they
>> > give
>> > > out.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly <marek.maly.ujep.cz>
>> wrote:
>> > >
>> > > > Dear all,
>> > > >
>> > > > I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.
>> > > >
>> > > > I did the first calculations (pmemd.cuda in Amber12) with systems
>> > around
>> > > > 60K atoms without any problems (NPT, Langevin), but when I later
>> tried
>> > > > with bigger systems (around 100K atoms) I obtained "classical"
>> > irritating
>> > > > errors
>> > > >
>> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> > > >
>> > > > just after few thousands of MD steps.
>> > > >
>> > > > So this was obviously the reason for memtestG80 tests.
>> > > > ( https://simtk.org/home/memtest ).
>> > > >
>> > > > So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz
>> )
>> and
>> > > > then tested
>> > > > just small part of memory GPU (200 MB) using 100 iterations.
>> > > >
>> > > > On both cards I have obtained huge amount of errors but "just" on
>> > > > "Random blocks:". 0 errors in all remaining tests in all
>> iterations.
>> > > >
>> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
>> > > >
>> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
>> > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
>> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
>> > > > True Walking zeros (8-bit): 0 errors (26 ms)
>> > > > True Walking ones (8-bit): 0 errors (26 ms)
>> > > > Moving Inversions (random): 0 errors (6 ms)
>> > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>> > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>> > > > Random blocks: 1369863 errors (27 ms)
>> > > > Memtest86 Modulo-20: 0 errors (215 ms)
>> > > > Logic (one iteration): 0 errors (4 ms)
>> > > > Logic (4 iterations): 0 errors (8 ms)
>> > > > Logic (shared memory, one iteration): 0 errors (8 ms)
>> > > > Logic (shared-memory, 4 iterations): 0 errors (25 ms)
>> > > >
>> > > > Final error count after 100 iterations over 200 MiB of GPU memory:
>> > > > 171106710 errors
>> > > >
>> > > > ------------------------------------------
>> > > >
>> > > > I have some questions and would be really grateful for any
>> comments.
>> > > >
>> > > > Regarding overclocking, using the deviceQuery I found out that
>> under
>> > linux
>> > > > both cards run
>> > > > automatically using boost shader/GPU frequency which is here 928
>> MHz
>> > (the
>> > > > basic value for these factory OC cards is 876 MHz). deviceQuery
>> > reported
>> > > > Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but
>> > maybe
>> > > > the quantity which is reported by deviceQuery "Memory Clock rate"
>> is
>> > > > different from the product specification "Memory Clock" . It seems
>> that
>> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just
>> > deviceQuery
>> > > > is not able to read this spec. properly
>> > > > in Titan GPU ?
>> > > >
>> > > > Anyway for the moment I assume that the problem might be due to
>> the
>> > high
>> > > > shader/GPU frequency.
>> > > > (see here : http://folding.stanford.edu/English/DownloadUtils )
>> > > >
>> > > > To verify this hypothesis one should perhaps UNDERclock to basic
>> > frequency
>> > > > which is in this
>> > > > model 876 MHz or even to the TITAN REFERENCE frequency which is
>> 837
>> > MHz.
>> > > >
>> > > > Obviously I am working with these cards under linux (CentOS
>> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under
>> linux
>> > are in
>> > > > fact limited just to NVclock utility, which is unfortunately
>> > > > out of date (at least speaking about the GTX Titan ). I have
>> obtained
>> > this
>> > > > message when I wanted
>> > > > just to let NVclock utility to read and print shader and memory
>> > > > frequencies of my Titan's:
>> > > >
>> > > >
>> -------------------------------------------------------------------
>> > > >
>> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>> > > > Card: Unknown Nvidia card
>> > > > Card number: 1
>> > > > Memory clock: -2147483.750 MHz
>> > > > GPU clock: -2147483.750 MHz
>> > > >
>> > > > Card: Unknown Nvidia card
>> > > > Card number: 2
>> > > > Memory clock: -2147483.750 MHz
>> > > > GPU clock: -2147483.750 MHz
>> > > >
>> > > >
>> > > >
>> -------------------------------------------------------------------
>> > > >
>> > > >
>> > > > I would be really grateful for some tips regarding "NVclock
>> > alternatives",
>> > > > but after wasting some hours with googling it seems that there is
>> no
>> > other
>> > > > Linux
>> > > > tool with NVclock functionality. So the only possibility is here
>> > perhaps
>> > > > to edit
>> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker,
>> > NVflash)
>> > > > but obviously
>> > > > I would like to rather avoid such approach as using it means
>> perhaps
>> > also
>> > > > to void the warranty even if I am going to underclock the GPUs
>> not to
>> > > > overclock them.
>> > > > So before this eventual step (GPU bios editing) I would like to
>> have
>> > some
>> > > > approximative estimate
>> > > > of the probability, that the problems are here really because of
>> the
>> > > > overclocking
>> > > > (too high (boost) default shader frequency).
>> > > >
>> > > > This probability I hope to estimate from the eventual responses of
>> > another
>> > > > Amber/Titan SC users, if I am not the only crazy guy who bought
>> this
>> > model
>> > > > for Amber calculations :)) But of course any eventual experiences
>> with
>> > > > Titan cards related to their memtestG80 results and
>> UNDER/OVERclocking
>> > > > (if possible in Linux OS) are of course welcomed as well !
>> > > >
>> > > > My HW/SW configuration
>> > > >
>> > > > motherboard: ASUS P9X79 PRO
>> > > > CPU: Intel Core i7-3930K
>> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
>> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>> > > > cooler: Cooler Master Hyper 412 SLIM
>> > > >
>> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>> > > > driver version: 319.17
>> > > > cudatoolkit_5.0.35_linux_64_rhel6.x
>> > > >
>> > > > The computer is in air-conditioned room with permanent external
>> > > > temperature around 18°C
>> > > >
>> > > >
>> > > > Thanks a lot in advance for any comment/experience !
>> > > >
>> > > > Best wishes,
>> > > >
>> > > > Marek
>> > > >
>> > > > --
>> > > > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> > > > http://www.opera.com/mail/
>> > > >
>> > > > _______________________________________________
>> > > > AMBER mailing list
>> > > > AMBER.ambermd.org
>> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > > >
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8385
> (20130528) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>

-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue May 28 2013 - 13:00:04 PDT