Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Marek Maly on 2013-05-27 (Amber Archive May 2013)

From: Marek Maly <marek.maly.ujep.cz>
Date: Tue, 28 May 2013 02:41:58 +0200

Hi Scott,

thanks for your response !

I have very actual update in this story.

#1
Thanks to one valuable response in EVGA forum (
http://www.evga.com/forums/tm.aspx?m=1940998 )
I finally learned that except "CLOSED" variants of memtestG80/memtestCL
which are
available here : https://simtk.org/project/xml/downloads.xml?group_id=385
there are also
available "OPEN" variants which seem to be more up to date and which
differ from "CLOSED" variants
at least in one thing: There was fixed sync error in random blocks test :))
here are the OPEN version src links:

memtestG80
https://github.com/ihaque/memtestG80
here is the sync fix code
https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c

memtestCL
https://github.com/ihaque/memtestCL
and fix code link
https://github.com/ihaque/memtestCL/commit/a7f25002cde6dc396a09870ec8b468cd9e3bd5ff

When I tested my factory OC TITANS with patched (OPEN) version of
memtestG80 I obtained
0 errors !!!

For the moment (just few minutes ago) I have tested 5 GB of memory using
300 iterations
( ./memtestG80 -g 1 5000 300 ) with zero number of errors (on both GPUs).
I think it may be
also enough but I have no problem to do more extensive tests. I can
eventually try another
linux based GPU testing tool as well if someone recommend me such a tool.

BTW maybe would be a good idea to use memtestG80 as a library which can be
used
e.g. in Amber installation process (cuda config part or cuda tests part ?)
to warn eventual Amber user in case that "soft errors" are present on his
"default" GPU. The memtest authors explicitly
speaks about such usage (see the links above). If this is not possible I
would perhaps
at least put the proper links to Amber GPU section: http://ambermd.org/gpus

#2
For the moment let assume that both my Factory-OC Titans are OK. So the
situation is
that I have perfectly working GTX TITAN cards (just under a bit "exotic"
frequency 928 MHz BTW
is there any other tool besides deviceQuery which can verify this working
frequency or eventually monitor GPU frequency in real time e.g. when GPU
is under load ? ). In spite the fact that GPUs are most likely OK I
experienced error in Amber simulation with cca 100k atoms system (just
protein in salt TIP3P water) after few thousands of steps while cca 60k
atoms system I was able to simulate without any problems for 45 ns.

here is the input file for both simulations:

  &cntrl
   imin=0,irest=1,ntx=5,
   nstlim=250000,dt=0.002,
   ntc=2,ntf=2,iwrap=0,ioutfm=1,
   cut=10.0, ntb=2, ntp=1, taup=2.0,
   ntpr=5000, ntwx=5000,
   ntt=3, gamma_ln=2.0,
   ig=-1,
   temp0=310.0,
  /

I will try to do some testing like :

a) to try other systems with let say 80-100k atoms
b) to play with thermostat i.e. ntt setting
c) try to see impact of type of ensemble (NVE,NVT,NPT)
d) see if that 60k atoms system which had no problems will
give the same results (energy components) as my another older and tested
GTX 580/680/TESLA C2050
when using fixed/constant ig value (is there any recommendation regarding
this value or
I can use some arbitrary seed ?) - something like 1e6 steps (i.e. 2 ns
with 2fs step) would be OK for such test ?

Any other idea what else I can try to help to debug/eliminate this error ?
Might be possible that such high clock frequency is the cause of some
similar synchronization error like in case of memtest ?

The last but of course really not my favourite option would be for me to
downclock to base 876 MHz
or even to stock base frequency 837 MHz. This should perhaps help to solve
the Amber issue, as
I did not register any TITAN failing contributions in mailing list
recently. It just came to my mind that I can also before try to install
the latest NVIDIA linux driver ( 319.23 ) actually I have ( 319.17 ) but I
am rather skeptic about any improvement regarding Amber problems with my
OC Titans.

So as I mentioned above I would be grateful for any support
(comments/suggestions/experiences) !

      Best wishes,

           Marek

Dne Tue, 28 May 2013 01:25:24 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> I have two GTX Titans. One is defective, the other is not.
> Unfortunately,
> they both pass all standard GPU memory tests.
>
> What the defective one doesn't do is generate reproducibly bit-accurate
> outputs for simulations of Factor IX (90,986 atoms) or larger, of 100K or
> so iterations.
>
> Which is yet another reason why I insist on MD algorithms (especially on
> GPUS) being deterministic. Besides its ability to find software bugs,
> and
> fulfilling one of the most important tenets of science, it's a great way
> to
> diagnose defective hardware with very little effort.
>
> 928 MHz? That's 6% above the boost clock of a stock Titan. Titan is
> pushing the performance envelope as is. If you're going to pay the
> premium
> for such chips, I'd send them back until you get one that runs correctly.
> I'm very curious how fast you can push one of these things before they
> give
> out.
>
>
>
>
>
>
>
> On Mon, May 27, 2013 at 10:01 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Dear all,
>>
>> I have recently bought two "EVGA GTX TITAN Superclocked" GPUs.
>>
>> I did the first calculations (pmemd.cuda in Amber12) with systems around
>> 60K atoms without any problems (NPT, Langevin), but when I later tried
>> with bigger systems (around 100K atoms) I obtained "classical"
>> irritating
>> errors
>>
>> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>
>> just after few thousands of MD steps.
>>
>> So this was obviously the reason for memtestG80 tests.
>> ( https://simtk.org/home/memtest ).
>>
>> So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz ) and
>> then tested
>> just small part of memory GPU (200 MB) using 100 iterations.
>>
>> On both cards I have obtained huge amount of errors but "just" on
>> "Random blocks:". 0 errors in all remaining tests in all iterations.
>>
>> ------THE LAST ITERATION AND FINAL RESULTS-------
>>
>> Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
>> Moving Inversions (ones and zeros): 0 errors (6 ms)
>> Memtest86 Walking 8-bit: 0 errors (53 ms)
>> True Walking zeros (8-bit): 0 errors (26 ms)
>> True Walking ones (8-bit): 0 errors (26 ms)
>> Moving Inversions (random): 0 errors (6 ms)
>> Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>> Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>> Random blocks: 1369863 errors (27 ms)
>> Memtest86 Modulo-20: 0 errors (215 ms)
>> Logic (one iteration): 0 errors (4 ms)
>> Logic (4 iterations): 0 errors (8 ms)
>> Logic (shared memory, one iteration): 0 errors (8 ms)
>> Logic (shared-memory, 4 iterations): 0 errors (25 ms)
>>
>> Final error count after 100 iterations over 200 MiB of GPU memory:
>> 171106710 errors
>>
>> ------------------------------------------
>>
>> I have some questions and would be really grateful for any comments.
>>
>> Regarding overclocking, using the deviceQuery I found out that under
>> linux
>> both cards run
>> automatically using boost shader/GPU frequency which is here 928 MHz
>> (the
>> basic value for these factory OC cards is 876 MHz). deviceQuery reported
>> Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but maybe
>> the quantity which is reported by deviceQuery "Memory Clock rate" is
>> different from the product specification "Memory Clock" . It seems that
>> "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just deviceQuery
>> is not able to read this spec. properly
>> in Titan GPU ?
>>
>> Anyway for the moment I assume that the problem might be due to the high
>> shader/GPU frequency.
>> (see here : http://folding.stanford.edu/English/DownloadUtils )
>>
>> To verify this hypothesis one should perhaps UNDERclock to basic
>> frequency
>> which is in this
>> model 876 MHz or even to the TITAN REFERENCE frequency which is 837 MHz.
>>
>> Obviously I am working with these cards under linux (CentOS
>> 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under linux are
>> in
>> fact limited just to NVclock utility, which is unfortunately
>> out of date (at least speaking about the GTX Titan ). I have obtained
>> this
>> message when I wanted
>> just to let NVclock utility to read and print shader and memory
>> frequencies of my Titan's:
>>
>> -------------------------------------------------------------------
>>
>> [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>> Card: Unknown Nvidia card
>> Card number: 1
>> Memory clock: -2147483.750 MHz
>> GPU clock: -2147483.750 MHz
>>
>> Card: Unknown Nvidia card
>> Card number: 2
>> Memory clock: -2147483.750 MHz
>> GPU clock: -2147483.750 MHz
>>
>>
>> -------------------------------------------------------------------
>>
>>
>> I would be really grateful for some tips regarding "NVclock
>> alternatives",
>> but after wasting some hours with googling it seems that there is no
>> other
>> Linux
>> tool with NVclock functionality. So the only possibility is here perhaps
>> to edit
>> GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker, NVflash)
>> but obviously
>> I would like to rather avoid such approach as using it means perhaps
>> also
>> to void the warranty even if I am going to underclock the GPUs not to
>> overclock them.
>> So before this eventual step (GPU bios editing) I would like to have
>> some
>> approximative estimate
>> of the probability, that the problems are here really because of the
>> overclocking
>> (too high (boost) default shader frequency).
>>
>> This probability I hope to estimate from the eventual responses of
>> another
>> Amber/Titan SC users, if I am not the only crazy guy who bought this
>> model
>> for Amber calculations :)) But of course any eventual experiences with
>> Titan cards related to their memtestG80 results and UNDER/OVERclocking
>> (if possible in Linux OS) are of course welcomed as well !
>>
>> My HW/SW configuration
>>
>> motherboard: ASUS P9X79 PRO
>> CPU: Intel Core i7-3930K
>> RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>> CASE: CoolerMaster Dominator CM-690 II Advanced,
>> Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>> GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>> cooler: Cooler Master Hyper 412 SLIM
>>
>> OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>> driver version: 319.17
>> cudatoolkit_5.0.35_linux_64_rhel6.x
>>
>> The computer is in air-conditioned room with permanent external
>> temperature around 18°C
>>
>>
>> Thanks a lot in advance for any comment/experience !
>>
>> Best wishes,
>>
>> Marek
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8382
> (20130527) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>

-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Mon May 27 2013 - 18:00:02 PDT