[AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Marek Maly on 2013-05-27 (Amber Archive May 2013)

From: Marek Maly <marek.maly.ujep.cz>
Date: Mon, 27 May 2013 19:01:39 +0200

Dear all,

I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.

I did the first calculations (pmemd.cuda in Amber12) with systems around
60K atoms without any problems (NPT, Langevin), but when I later tried
with bigger systems (around 100K atoms) I obtained "classical" irritating
errors

cudaMemcpy GpuBuffer::Download failed unspecified launch failure

just after few thousands of MD steps.

So this was obviously the reason for memtestG80 tests.
( https://simtk.org/home/memtest ).

So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz ) and
then tested
just small part of memory GPU (200 MB) using 100 iterations.

On both cards I have obtained huge amount of errors but "just" on
"Random blocks:". 0 errors in all remaining tests in all iterations.

------THE LAST ITERATION AND FINAL RESULTS-------

Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
        Moving Inversions (ones and zeros): 0 errors (6 ms)
        Memtest86 Walking 8-bit: 0 errors (53 ms)
        True Walking zeros (8-bit): 0 errors (26 ms)
        True Walking ones (8-bit): 0 errors (26 ms)
        Moving Inversions (random): 0 errors (6 ms)
        Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
        Memtest86 Walking ones (32-bit): 0 errors (104 ms)
        Random blocks: 1369863 errors (27 ms)
        Memtest86 Modulo-20: 0 errors (215 ms)
        Logic (one iteration): 0 errors (4 ms)
        Logic (4 iterations): 0 errors (8 ms)
        Logic (shared memory, one iteration): 0 errors (8 ms)
        Logic (shared-memory, 4 iterations): 0 errors (25 ms)

Final error count after 100 iterations over 200 MiB of GPU memory:
171106710 errors

------------------------------------------

I have some questions and would be really grateful for any comments.

Regarding overclocking, using the deviceQuery I found out that under linux
both cards run
automatically using boost shader/GPU frequency which is here 928 MHz (the
basic value for these factory OC cards is 876 MHz). deviceQuery reported
Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but maybe
the quantity which is reported by deviceQuery "Memory Clock rate" is
different from the product specification "Memory Clock" . It seems that
"Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just deviceQuery
is not able to read this spec. properly
in Titan GPU ?

Anyway for the moment I assume that the problem might be due to the high
shader/GPU frequency.
(see here : http://folding.stanford.edu/English/DownloadUtils )

To verify this hypothesis one should perhaps UNDERclock to basic frequency
which is in this
model 876 MHz or even to the TITAN REFERENCE frequency which is 837 MHz.

Obviously I am working with these cards under linux (CentOS
2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under linux are in
fact limited just to NVclock utility, which is unfortunately
out of date (at least speaking about the GTX Titan ). I have obtained this
message when I wanted
just to let NVclock utility to read and print shader and memory
frequencies of my Titan's:

-------------------------------------------------------------------

[root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
Card: Unknown Nvidia card
Card number: 1
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

Card: Unknown Nvidia card
Card number: 2
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

-------------------------------------------------------------------

I would be really grateful for some tips regarding "NVclock alternatives",
but after wasting some hours with googling it seems that there is no other
Linux
tool with NVclock functionality. So the only possibility is here perhaps
to edit
GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker, NVflash)
but obviously
I would like to rather avoid such approach as using it means perhaps also
to void the warranty even if I am going to underclock the GPUs not to
overclock them.
So before this eventual step (GPU bios editing) I would like to have some
approximative estimate
of the probability, that the problems are here really because of the
overclocking
(too high (boost) default shader frequency).

This probability I hope to estimate from the eventual responses of another
Amber/Titan SC users, if I am not the only crazy guy who bought this model
for Amber calculations :)) But of course any eventual experiences with
Titan cards related to their memtestG80 results and UNDER/OVERclocking
(if possible in Linux OS) are of course welcomed as well !

My HW/SW configuration

motherboard: ASUS P9X79 PRO
CPU: Intel Core i7-3930K
RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
CASE: CoolerMaster Dominator CM-690 II Advanced,
Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
cooler: Cooler Master Hyper 412 SLIM

OS: CentOS (2.6.32-358.6.1.el6.x86_64)
driver version: 319.17
cudatoolkit_5.0.35_linux_64_rhel6.x

The computer is in air-conditioned room with permanent external
temperature around 18°C

   Thanks a lot in advance for any comment/experience !

      Best wishes,

           Marek

-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Mon May 27 2013 - 10:30:02 PDT