Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-05-29 (Amber Archive May 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 29 May 2013 13:41:46 -0700

Your Titan setup is hosed. Your results were not 100% deterministic for
the same inputs.

Energies + Forces use a different subroutine than just Forces hence the
ntpr dependence. Hence changing ntpr effectively is changing the input.

It's 100% ironclad reproducibility that matters and you demonstrated it's
not happening.
On May 29, 2013 1:30 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:

> Hi all,
>
> First of all thanks to Ross for his update ! although it is question
> if it helps to solve all the reported Amber issues with Titan/OC Titan
> GPUs .
> So let's see and hope :))
>
> Here are my results - see the attached TXT file with tables where
> the results from the tests are summarised. I did twice the same
> Amber benchmark tests on each GPU (both titans, GTX 680 and GTX 580)
> to see reproducibility of the results after 100K steps at ig=default
> (so ig not present in mdin file).
>
> The first table contains ns/day estimates obtained for each molecular
> system
> for each TITAN GPU. Interestingly estimates obtained for the same system
> in different
> round slightly differ, but maybe that's OK.
>
> In the second table there are values of the total energy after 100k steps
> to check
> reproducibility of the results.
>
> Here is summarisation :
>
> #1 - simulation crashes on TITANs
>
> Interestingly there was just one simulation crash in JAC_NPT (TITAN_0,
> ROUND_1) the remaining
> 3 TITAN JAC_NPT simulations were finished. There were also 3 times crashes
> in CELLULOSE_NVE
> test but the last simulation (TITAN_1,ROUND_2) was finished without any
> problem. All the remaining
> simulations were always finished without any problem. So the simulation
> crashes seem to be
> not-reproducible/unpredictible on some moleacular systems/(mdin setups).
>
> CRASH ERRORS:
>
> a) JAC_NPT (TITAN_0, ROUND_1)
> Here 11k steps were successfully done before crash, I found this error
> in mdout file:
>
> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
>
> b) CELLULOSE_NVE (TITAN_0, ROUND_1, ROUND_2; TITAN_1, ROUND_1 )
> Here I did not find any error in mdout file. Just this error was written
> on standard output
> (screen/nohup.out file):
>
> ------
> Error: unspecified launch failure launching kernel kNLSkinTest
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> grep: mdinfo.1GTX_TITAN: No such file or directory
> -----
>
> in all three cases.
>
> Here on CELLULOSE_NVE case I started to play with NTPR parameter
> (originally just
> on TITAN-0 GPU), to see how many steps were successfully done here before
> crash, then this my
> small research started to be more interesting than I ever thought :)) see
> here
> chronologically my results for E_tot after 2000 steps for different GPUs
> (machines) - I repeated calculation several times for the given NTPR just
> to be sure.
>
> TITAN-0, Etot after 2000 steps
>
> NTPR=10
>
> -443256.6867
> -443256.6867
> -443256.6867
>
> NTPR=100
>
> -443250.1350
> -443250.1350
> -443250.1350
>
> NTPR=200
>
> -443261.0705
> -443261.0705
> -443072.3097
> -443261.0705
> -443261.0705
> -443261.0705
> -443261.0705
>
> NTPR=10 (again just to verify)
>
> -443256.6867
> -443256.6867
>
>
> Then I tried with TITAN-1
>
> NTPR=10
>
> -443256.6867
> -443256.6867
>
> NTPR=100
>
> -443250.1350
> -443250.1350
>
> NTPR=200
>
> -443261.0705
> -443261.0705
>
>
> Then I tried with GTX-580
>
> NTPR=10
>
> -443256.6867
> -443256.6867
>
> NTPR=200
>
> -443261.0705
> -443261.0705
>
> then I tried with GTX-680
>
> NTPR=10 Etot after 2000 steps
>
> -443256.6711
> -443256.6711
>
> NTPR=200 Etot after 2000 steps
>
> -443261.0705
> -443261.0705
>
> Any idea why energies should depend on frequency of energy records (NTPR) ?
>
>
>
> #2 - reproducibility on TITANs (see attached table.txt)
>
> Also here are differences depending on concrete systems/setups.
> While in case of FACTOR_IX_NVE, FACTOR_IX_NPT, TRPCAGE, MYOGLOBIN systems
> I have obtained
> 100% reproducibility (the results for the given system were identical for
> both cards/all ROUNDs)
> on systems JAC_NVE, JAC_NPT, NUCLEOSOME I obtained small differences in
> general however in case
> of TITAN_1 GPU also NUCLEOSOME results were 100% reproducible. Moreover
> for the TITAN_1 card which succeeded to finish CELLULOSE test at least in
> ROUND_2 I did 3rd additional round and I got the identical result as from
> the ROUND_2 (i.e. -443246.3206 ) so regarding TITAN_1 GPU I can say that
> it is able to 100% reproduce 100k steps CELLULOSE_NVE test result perhaps
> on all eventually successfully finished runs :))
>
>
> #3 - GTX-580, GTX-680 controls
>
> Here the simulations were done without any problems and were 100%
> reproducible on each card however
> the results for the given system slightly differ between those two cards
> with exception of the
> CELLULOSE system where both cards GTX-580, GTX-680 provided identical
> result which is moreover
> nearly identical with result obtained with TITAN_1 during ROUND_2
> (relative difference 2e-6).
>
>
> TO ET:
> a)
> I had no problems with minimization stages in my own simul. bigger than
> 100k which crashed
> during heat NVT phase.
>
> b)
> 313.30 driver ??? OK so after 319.23 I will try experiment with this a bit
> "outdated" version :))
> Actually I am working under 319.17. (and CUDA 5.0)
>
> c)
> Can you please do at least JAC_NPT, JAC_NVE, NUCLEOSOME and CELLULOSE_NVE
> tests using 100 000 steps
> (same random seed e.g. default = ig deleted from mdin if is there) twice
> to confirm 100% reproducibility on your TITAN GPU ?
>
> TO Divi:
>
> This is also my usual approach to divide whole simulation into many
> subtrajectories (in my case 0.5 ns = 250k 2fs steps) but it does not seem
> to help here it self. Can you please also do the same tests which I asked
> ET (point c) )
>
>
> BTW CUDA release candidate 5.5 was just released (
> https://developer.nvidia.com/**cuda-toolkit<https://developer.nvidia.com/cuda-toolkit>)
> would it be reasonable idea to try compile/run pmemd.cuda with this brand
> new cuda version ?
>
> Thanks !
>
> Best wishes,
>
> Marek
>
>
>
>
>
>
> Dne Wed, 29 May 2013 03:44:33 +0200 Ross Walker <ross.rosswalker.co.uk>
> napsal/-a:
>
> Hi All,
>>
>> Just an update that we will have some fixes out soon that address some
>> errors we have been noticing with simulations crashing during NPT runs. It
>> is possible that this is confusing the issue here as to whether the
>> problem is related to the GTX Titan or to a possible bug in the code. I
>> hope to have the patch released within a few days at which point it would
>> be good to repeat these tests and then hopefully we can try to track down
>> what is going on. I find it hard to believe that so many cards are faulty
>> so I suspect that there may be something funky in the code with regards to
>> GTX Titans. We'll try and get it fixed as soon as possible but for now
>> please just wait until we get the update released for AMBER 12 in a few
>> days and see if that helps at all.
>>
>> All the best
>> Ross
>>
>>
>> On 5/28/13 5:12 PM, "Divi/GMAIL" <dvenkatlu.gmail.com> wrote:
>>
>> I have two TITANs in my Gigabyte workstation. I have had similar
>>> issues
>>> of NANs for some of the simulation setups. Never could figure out why the
>>> simulations failed for no reason. I tried 10, 12 ang. box sizes. same
>>> random breakdowns. Thought of returning them suspecting memory errors.
>>> But
>>> some simulations ran perfectly fine. Currently running two calculations
>>> without any problems. Both are running pretty stable for over 100ns. I
>>> suspect AMBER CUDA code may have some issues under some simulation
>>> conditions such as NPT. In general, NVT setup is more successful than
>>> NPT,
>>> in my case.
>>>
>>> These are 287426 atoms simulation on one card (9 ns/day)
>>> On other card: 129049 atom setup (20 ns/day)
>>>
>>> Both using same NVT setup. (AMBER12/INTEL-12.x
>>> compilers/CentOS-6.3/Drivers 319.17/CUDA5.0)
>>>
>>> Input is below:
>>> &cntrl
>>> nstlim=500000, dt=0.002,
>>> ntx=5, irest=1, ig=-1,
>>> ntpr=1000, ntwr=10000, ntwx=10000,
>>> ntt=1, tautp=2, ntb=1, ntp=0, ntc=2, ntf=2,
>>> iwrap=1, ioutfm=1, ntxo=2,
>>> &end
>>>
>>> One suggestion If I may add: If you could run short simulations for no
>>> more
>>> than 500,000 steps (or 1ns with 2 fs), you might find some stability.
>>> Again,
>>> not scientific rationale from my side. But it worked in some cases for
>>> me.
>>>
>>> This is self-assembled system with GIGABYTE GA-Z77X-UP7 (with core i5
>>> processor) and 1200W PS/16GB memory.
>>>
>>>
>>> Best regards
>>> Divi
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Scott Le Grand
>>> Sent: Tuesday, May 28, 2013 4:46 PM
>>> To: AMBER Mailing List
>>> Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>> memtestG80 - UNDERclocking in Linux ?
>>>
>>> You can play Russian Roulette a whole bunch of rounds without blowing
>>> your
>>> head off.
>>>
>>> Similarly, when you have a GPU that occasionally flips a bit the wrong
>>> way,
>>> most of the time it will be some low order perturbation to the
>>> coordinates
>>> that does little more than make the trajectory nondeterministic...
>>> Except
>>> when it doesn't...
>>>
>>> You can't even detect this kind of misbehavior in GROMACS, ACEMD, or NAMD
>>> because *none* of them (to my knowledge) are capable of producing
>>> deterministic output at production-level performance.
>>>
>>> Titans and 680s are consumer cards. I love them to death, but if you're
>>> going to do production work with them, you need to qual them thoroughly
>>> before proceeding or you need to pay up and use Teslas instead. I'd
>>> still
>>> build a cluster with Titans myself, but I'd ruthlessly RMA them until I
>>> got
>>> satisfaction if they couldn't pass a test consisting of running an AMBER
>>> simulation for 100K iterations without either crashing or producing a
>>> nondeterministic result. The customer is always right.
>>>
>>>
>>> On Tue, May 28, 2013 at 1:20 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>>>
>>> I would wait for the results of my GOPU0, GPU1 double tests before
>>>> any serious conclusions.
>>>>
>>>> BTW what exactly means "GPU is hosed" ? Something like GPU is damaged or
>>>> so ?
>>>>
>>>> Also would be strange (not probable) to buy 2 somehow damaged GPUs (even
>>>> in the same way).
>>>>
>>>> As I wrote, memtestG80 tests were negative on both cards, if moreover
>>>> both cards will perfectly reproduce both repetitions of the Amber
>>>> benchmarks
>>>> and eventually pass some another GPU tests (can you recommend any except
>>>> memtestG80 ?)
>>>> I still believe that the GPU cards are OK (also thank to particular
>>>> successes in my Amb. simulations and actual A. benchmarks). So maybe I
>>>> will eventually try downclock, but there might be some another
>>>> variables,
>>>> e.g. driver, OS, motherboard (I will perhaps test one card in another MB
>>>> just to be sure, that problem is not MB based) etc. that's why I asked
>>>> before that guy "ET" for the info about driver version, would be also
>>>> interesting OS info or MB.
>>>>
>>>> M.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dne Tue, 28 May 2013 22:13:36 +0200 Scott Le Grand
>>>> <varelse2005.gmail.com>
>>>> napsal/-a:
>>>>
>>>> > Marek,
>>>> > Your GPU is hosed. I don't have anything else to add. I'm not going
>>>> to
>>>> > go
>>>> > snark hunting for a bug that doesn't exist.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, May 28, 2013 at 12:24 PM, Marek Maly <marek.maly.ujep.cz>
>>>> wrote:
>>>> >
>>>> >> Hi, just for the curiosity which driver are you using
>>>> >> on that machine with perfectly working with OC TITAN,
>>>> >> 319.17 or some more actual e.g. 319.23 ?
>>>> >>
>>>> >> RMA is a good idea but it could be also long time story and
>>>> >> also to succeed here you need to have strong arguments
>>>> >> especially if you are going to RMA two OC TITANs.
>>>> >>
>>>> >> I am not sure if my arguments "The cards have problems with some
>>>> Amber
>>>> >> calculations"
>>>> >> would be strong enough here. Would be much better to have clear
>>>> results
>>>> >> from
>>>> >> respected GPU tests and as it seems you may do extensive GPU tests
>>>> also
>>>> >> with
>>>> >> multiple routines without any errors but still have problems with
>>>> >> particular
>>>> >> Amber simulations...
>>>> >>
>>>> >> BTW I am now doing Amber benchmarks with nstlim=100K and ig=default
>>>> for
>>>> >> each card
>>>> >> twice. The tests will be done in cca 3 hours (due to slow nucleosome
>>>> GB
>>>> >> test).
>>>> >>
>>>> >> But even now I have interesting results from the first test on GPU0
>>>> >> (nucleosome is still running) see below.
>>>> >>
>>>> >> As you can see JAC_NPT crashed around 11000 step, here is the last
>>>> >> md.out
>>>> >> record:
>>>> >>
>>>> >> *********
>>>> >>
>>>> >>
>>>>
>>>> ------------------------------**------------------------------**
>>>> -------------
>>>> -----
>>>> >>
>>>> >> check COM velocity, temp: 0.000021 0.00(Removed)
>>>> >>
>>>> >> NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39
>>>> PRESS
>>>> >> =
>>>> >> -9.4
>>>> >> Etot = -58092.8958 EKtot = 14440.2520 EPtot =
>>>> >> -72533.1478
>>>> >> BOND = 443.3912 ANGLE = 1253.5177 DIHED =
>>>> >> 970.1275
>>>> >> 1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
>>>> >> 8664.9960
>>>> >> EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
>>>> >> 0.0000
>>>> >> EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
>>>> >> 236141.9494
>>>> >> Density =
>>>> >> 1.0162
>>>> >>
>>>> >>
>>>>
>>>> ------------------------------**------------------------------**
>>>> -------------
>>>> -----
>>>> >>
>>>> >> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>>>> >> radius!
>>>> >>
>>>> >> ********
>>>> >>
>>>> >> Any idea about that ERROR ?
>>>> >>
>>>> >> On the other hand FACTOR_IX_NPT which has much more atoms passed
>>>> >> without
>>>> >> any issue.
>>>> >>
>>>> >> Cellulose crashed on the beginning without any ERROR message in
>>>> md.out
>>>> >> file.
>>>> >>
>>>> >>
>>>> >> I am very curious regarding exact reproducibility of the results at
>>>> >> least
>>>> >> in the
>>>> >> framework of both tests on individual cards.
>>>> >>
>>>> >> BTW regarding eventual downclocking, has anyone idea about some
>>>> NVclock
>>>> >> alternative or
>>>> >> I will be really eventually forced to edit frequency value in GPU
>>>> BIOS
>>>> >> ?
>>>> >>
>>>> >> Best,
>>>> >>
>>>> >> Marek
>>>> >>
>>>> >> HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests
>>>> >>
>>>> >> JAC_PRODUCTION_NVE - 23,558 atoms PME
>>>> >> ------------------------------**-------
>>>> >>
>>>> >> 1 x GTX_TITAN: | ns/day = 115.91 seconds/ns =
>>>> >> 745.39
>>>> >>
>>>> >> JAC_PRODUCTION_NPT - 23,558 atoms PME
>>>> >> ------------------------------**-------
>>>> >>
>>>> >> 1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
>>>> >> | ns/day = 90.72 seconds/ns = 952.42
>>>> >>
>>>> >> FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
>>>> >> ------------------------------**-------------
>>>> >>
>>>> >> 1 x GTX_TITAN: | ns/day = 30.56
>>>> seconds/ns =
>>>> >> 2827.33
>>>> >>
>>>> >> FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
>>>> >> ------------------------------**-------------
>>>> >>
>>>> >> 1 x GTX_TITAN: | ns/day = 25.01
>>>> seconds/ns =
>>>> >> 3454.56
>>>> >>
>>>> >> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>>> >> ------------------------------**--------------
>>>> >>
>>>> >> 1 x GTX_TITAN: Error: unspecified launch failure launching
>>>> >> kernel
>>>> >> kNLSkinTest
>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>> >> grep: mdinfo.1GTX_TITAN: No such file or directory
>>>> >>
>>>> >> TRPCAGE_PRODUCTION - 304 atoms GB
>>>> >> ------------------------------**---
>>>> >> 1 x GTX_TITAN: | ns/day = 595.09 seconds/ns =
>>>> >> 145.19
>>>> >>
>>>> >> MYOGLOBIN_PRODUCTION - 2,492 atoms GB
>>>> >> ------------------------------**-------
>>>> >>
>>>> >> 1 x GTX_TITAN: | ns/day = 202.56 seconds/ns =
>>>> >> 426.53
>>>> >>
>>>> >> NUCLEOSOME_PRODUCTION - 25,095 atoms GB
>>>> >> ------------------------------**---------
>>>> >>
>>>> >> 1 x GTX_TITAN:
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com>
>>>> napsal/-a:
>>>> >>
>>>> >> > Hi,
>>>> >> >
>>>> >> > I just got a superclocked Titan and one at normal freq. The first
>>>> one
>>>> >> ran
>>>> >> > like a charm with no issues so far. The other standard clocked one
>>>> >> could
>>>> >> > never get past the constant pressure stage in an NPT simulation. It
>>>> >> kept
>>>> >> > writing NAN or ********* in the outfile. I swapped them about in
>>>> the
>>>> >> pcie
>>>> >> > lanes then ran it solo in each one of the lanes. Despite all this
>>>> it
>>>> >> was
>>>> >> > still failing the benchmark that the other one had no problems
>>>> with.
>>>> >> >
>>>> >> > I couldn't find any memory errors with GPU-burn either, but as they
>>>> >> cost
>>>> >> > near a grand a piece, I RMA'd it today. I recommend you to do the
>>>> >> same if
>>>> >> > its not giving you any joy. Life's too short. :)
>>>> >> >
>>>> >> > br,
>>>> >> > g
>>>> >> >
>>>> >> >
>>>> >> > On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> >> AMBER != NAMD...
>>>> >> >>
>>>> >> >> GTX 680 != GTX Titan...
>>>> >> >>
>>>> >> >> Ian's suggestion is a good one. But even then, you need to test
>>>> >> >> your
>>>> >> >> GPUs
>>>> >> >> as the Titans are running right on the edge of stability. Like I
>>>> >> told
>>>> >> >> Marek, try running 100K iterations of Cellulose NVE twice with the
>>>> >> same
>>>> >> >> random seed. if you don't get identically bit accurate output,
>>>> your
>>>> >> >> GPU is
>>>> >> >> not working. Memtest programs do not catch this because (I am
>>>> >> guessing)
>>>> >> >> they are designed for a uniform memory hierarchy and only one
>>>> path
>>>> >> >> to
>>>> >> >> read
>>>> >> >> and write data. I have a stock GTX Titan that cannot pass the
>>>> >> Cellulose
>>>> >> >> NVE test and another one that does. I spent a couple days on the
>>>> >> former
>>>> >> >> GPU looking for the imaginary bug that went away like magic the
>>>> >> second I
>>>> >> >> switched out the GPU.
>>>> >> >>
>>>> >> >> Scott
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu>
>>>> wrote:
>>>> >> >>
>>>> >> >> > Hi Scott,
>>>> >> >> >
>>>> >> >> > unfortunately we are seeing similar Amber instability on GTX
>>>> >> Titans as
>>>> >> >> > Marek is. We have a box with four GTX Titans (not oveclocked)
>>>> >> running
>>>> >> >> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
>>>> >> >> simulation
>>>> >> >> > longer than 10-15 min eventually crashes on these cards,
>>>> including
>>>> >> >> both
>>>> >> >> JAC
>>>> >> >> > benchmarks (with extended run time). This is reproducible on all
>>>> >> four
>>>> >> >> > cards.
>>>> >> >> >
>>>> >> >> > To eliminate the possible hardware error we ran extended GPU
>>>> >> >> > memory
>>>> >> >> tests
>>>> >> >> > on all four Titans with memtestG80, cuda_memtest and also
>>>> gpu_burn
>>>> >> -
>>>> >> >> all
>>>> >> >> > finished without errors. Since I agree that these programs may
>>>> not
>>>> >> >> test
>>>> >> >> the
>>>> >> >> > GPU completely we also set up simulations with NAMD. We can run
>>>> >> four
>>>> >> >> NAMD
>>>> >> >> > simulations simultaneously for many days without any errors on
>>>> >> >> > this
>>>> >> >> > hardware. For reference - we also have exactly the same server
>>>> >> >> > with
>>>> >> >> the
>>>> >> >> > same hardware components but with four GTX680s and this setup
>>>> >> >> > works
>>>> >> >> just
>>>> >> >> > fine for Amber. So all this leads me to believe that a hardware
>>>> >> error
>>>> >> >> is
>>>> >> >> > not very likely.
>>>> >> >> >
>>>> >> >> > I would appreciate your comments on this, perhaps there is
>>>> >> something
>>>> >> >> else
>>>> >> >> > causing these errors which we are not seeing.
>>>> >> >> >
>>>> >> >> > Thanks,
>>>> >> >> >
>>>> >> >> > Robert
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
>>>> >> >> > > I have two GTX Titans. One is defective, the other is not.
>>>> >> >> > Unfortunately,
>>>> >> >> > > they both pass all standard GPU memory tests.
>>>> >> >> > >
>>>> >> >> > > What the defective one doesn't do is generate reproducibly
>>>> >> >> bit-accurate
>>>> >> >> > > outputs for simulations of Factor IX (90,986 atoms) or
>>>> larger,
>>>> >> >> > > of
>>>> >> >> 100K
>>>> >> >> or
>>>> >> >> > > so iterations.
>>>> >> >> > >
>>>> >> >> > > Which is yet another reason why I insist on MD algorithms
>>>> >> >> (especially
>>>> >> >> on
>>>> >> >> > > GPUS) being deterministic. Besides its ability to find
>>>> software
>>>> >> >> bugs,
>>>> >> >> > and
>>>> >> >> > > fulfilling one of the most important tenets of science, it's a
>>>> >> great
>>>> >> >> way
>>>> >> >> > to
>>>> >> >> > > diagnose defective hardware with very little effort.
>>>> >> >> > >
>>>> >> >> > > 928 MHz? That's 6% above the boost clock of a stock Titan.
>>>> >> Titan
>>>> >> >> is
>>>> >> >> > > pushing the performance envelope as is. If you're going to
>>>> pay
>>>> >> the
>>>> >> >> > premium
>>>> >> >> > > for such chips, I'd send them back until you get one that runs
>>>> >> >> correctly.
>>>> >> >> > > I'm very curious how fast you can push one of these things
>>>> >> >> > > before
>>>> >> >> they
>>>> >> >> > give
>>>> >> >> > > out.
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly
>>>> <marek.maly.ujep.cz
>>>> >
>>>> >> >> wrote:
>>>> >> >> > >
>>>> >> >> > > > Dear all,
>>>> >> >> > > >
>>>> >> >> > > > I have recently bought two "EVGA GTX TITAN Superclocked"
>>>> GPUs.
>>>> >> >> > > >
>>>> >> >> > > > I did the first calculations (pmemd.cuda in Amber12) with
>>>> >> systems
>>>> >> >> > around
>>>> >> >> > > > 60K atoms without any problems (NPT, Langevin), but when I
>>>> >> later
>>>> >> >> tried
>>>> >> >> > > > with bigger systems (around 100K atoms) I obtained
>>>> "classical"
>>>> >> >> > irritating
>>>> >> >> > > > errors
>>>> >> >> > > >
>>>> >> >> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch
>>>> >> failure
>>>> >> >> > > >
>>>> >> >> > > > just after few thousands of MD steps.
>>>> >> >> > > >
>>>> >> >> > > > So this was obviously the reason for memtestG80 tests.
>>>> >> >> > > > ( https://simtk.org/home/memtest ).
>>>> >> >> > > >
>>>> >> >> > > > So I compiled memtestG80 from sources (
>>>> >> memtestG80-1.1-src.tar.gz
>>>> >> >> )
>>>> >> >> and
>>>> >> >> > > > then tested
>>>> >> >> > > > just small part of memory GPU (200 MB) using 100 iterations.
>>>> >> >> > > >
>>>> >> >> > > > On both cards I have obtained huge amount of errors but
>>>> "just"
>>>> >> on
>>>> >> >> > > > "Random blocks:". 0 errors in all remaining tests in all
>>>> >> >> iterations.
>>>> >> >> > > >
>>>> >> >> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
>>>> >> >> > > >
>>>> >> >> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
>>>> >> >> > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
>>>> >> >> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
>>>> >> >> > > > True Walking zeros (8-bit): 0 errors (26 ms)
>>>> >> >> > > > True Walking ones (8-bit): 0 errors (26 ms)
>>>> >> >> > > > Moving Inversions (random): 0 errors (6 ms)
>>>> >> >> > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>>>> >> >> > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>>>> >> >> > > > Random blocks: 1369863 errors (27 ms)
>>>> >> >> > > > Memtest86 Modulo-20: 0 errors (215 ms)
>>>> >> >> > > > Logic (one iteration): 0 errors (4 ms)
>>>> >> >> > > > Logic (4 iterations): 0 errors (8 ms)
>>>> >> >> > > > Logic (shared memory, one iteration): 0 errors (8
>>>> ms)
>>>> >> >> > > > Logic (shared-memory, 4 iterations): 0 errors (25
>>>> ms)
>>>> >> >> > > >
>>>> >> >> > > > Final error count after 100 iterations over 200 MiB of GPU
>>>> >> memory:
>>>> >> >> > > > 171106710 errors
>>>> >> >> > > >
>>>> >> >> > > > ------------------------------**------------
>>>> >> >> > > >
>>>> >> >> > > > I have some questions and would be really grateful for any
>>>> >> >> comments.
>>>> >> >> > > >
>>>> >> >> > > > Regarding overclocking, using the deviceQuery I found out
>>>> that
>>>> >> >> under
>>>> >> >> > linux
>>>> >> >> > > > both cards run
>>>> >> >> > > > automatically using boost shader/GPU frequency which is here
>>>> >> 928
>>>> >> >> MHz
>>>> >> >> > (the
>>>> >> >> > > > basic value for these factory OC cards is 876 MHz).
>>>> >> >> > > > deviceQuery
>>>> >> >> > reported
>>>> >> >> > > > Memory Clock rate is 3004 MHz although "it" should be 6008
>>>> MHz
>>>> >> but
>>>> >> >> > maybe
>>>> >> >> > > > the quantity which is reported by deviceQuery "Memory Clock
>>>> >> rate"
>>>> >> >> is
>>>> >> >> > > > different from the product specification "Memory Clock" . It
>>>> >> seems
>>>> >> >> that
>>>> >> >> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just
>>>> >> >> > deviceQuery
>>>> >> >> > > > is not able to read this spec. properly
>>>> >> >> > > > in Titan GPU ?
>>>> >> >> > > >
>>>> >> >> > > > Anyway for the moment I assume that the problem might be
>>>> due
>>>> >> >> > > > to
>>>> >> >> the
>>>> >> >> > high
>>>> >> >> > > > shader/GPU frequency.
>>>> >> >> > > > (see here :
>>>> http://folding.stanford.edu/**English/DownloadUtils<http://folding.stanford.edu/English/DownloadUtils>
>>>> )
>>>> >> >> > > >
>>>> >> >> > > > To verify this hypothesis one should perhaps UNDERclock to
>>>> >> basic
>>>> >> >> > frequency
>>>> >> >> > > > which is in this
>>>> >> >> > > > model 876 MHz or even to the TITAN REFERENCE frequency
>>>> which
>>>> >> >> > > > is
>>>> >> >> 837
>>>> >> >> > MHz.
>>>> >> >> > > >
>>>> >> >> > > > Obviously I am working with these cards under linux (CentOS
>>>> >> >> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools
>>>> under
>>>> >> >> linux
>>>> >> >> > are in
>>>> >> >> > > > fact limited just to NVclock utility, which is unfortunately
>>>> >> >> > > > out of date (at least speaking about the GTX Titan ). I have
>>>> >> >> obtained
>>>> >> >> > this
>>>> >> >> > > > message when I wanted
>>>> >> >> > > > just to let NVclock utility to read and print shader and
>>>> >> >> > > > memory
>>>> >> >> > > > frequencies of my Titan's:
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >>
>>>> ------------------------------**------------------------------**-------
>>>> >> >> > > >
>>>> >> >> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>>>> >> >> > > > Card: Unknown Nvidia card
>>>> >> >> > > > Card number: 1
>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>> >> >> > > >
>>>> >> >> > > > Card: Unknown Nvidia card
>>>> >> >> > > > Card number: 2
>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >>
>>>> ------------------------------**------------------------------**-------
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >> > > > I would be really grateful for some tips regarding "NVclock
>>>> >> >> > alternatives",
>>>> >> >> > > > but after wasting some hours with googling it seems that
>>>> there
>>>> >> is
>>>> >> >> no
>>>> >> >> > other
>>>> >> >> > > > Linux
>>>> >> >> > > > tool with NVclock functionality. So the only possibility is
>>>> >> here
>>>> >> >> > perhaps
>>>> >> >> > > > to edit
>>>> >> >> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS
>>>> >> >> > > > Tweaker,
>>>> >> >> > NVflash)
>>>> >> >> > > > but obviously
>>>> >> >> > > > I would like to rather avoid such approach as using it means
>>>> >> >> perhaps
>>>> >> >> > also
>>>> >> >> > > > to void the warranty even if I am going to underclock the
>>>> GPUs
>>>> >> >> not to
>>>> >> >> > > > overclock them.
>>>> >> >> > > > So before this eventual step (GPU bios editing) I would
>>>> like
>>>> >> >> > > > to
>>>> >> >> have
>>>> >> >> > some
>>>> >> >> > > > approximative estimate
>>>> >> >> > > > of the probability, that the problems are here really
>>>> because
>>>> >> of
>>>> >> >> the
>>>> >> >> > > > overclocking
>>>> >> >> > > > (too high (boost) default shader frequency).
>>>> >> >> > > >
>>>> >> >> > > > This probability I hope to estimate from the eventual
>>>> >> responses of
>>>> >> >> > another
>>>> >> >> > > > Amber/Titan SC users, if I am not the only crazy guy who
>>>> >> >> > > > bought
>>>> >> >> this
>>>> >> >> > model
>>>> >> >> > > > for Amber calculations :)) But of course any eventual
>>>> >> experiences
>>>> >> >> with
>>>> >> >> > > > Titan cards related to their memtestG80 results and
>>>> >> >> UNDER/OVERclocking
>>>> >> >> > > > (if possible in Linux OS) are of course welcomed as well !
>>>> >> >> > > >
>>>> >> >> > > > My HW/SW configuration
>>>> >> >> > > >
>>>> >> >> > > > motherboard: ASUS P9X79 PRO
>>>> >> >> > > > CPU: Intel Core i7-3930K
>>>> >> >> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>>>> >> >> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
>>>> >> >> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>>>> >> >> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>>>> >> >> > > > cooler: Cooler Master Hyper 412 SLIM
>>>> >> >> > > >
>>>> >> >> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>>>> >> >> > > > driver version: 319.17
>>>> >> >> > > > cudatoolkit_5.0.35_linux_64_**rhel6.x
>>>> >> >> > > >
>>>> >> >> > > > The computer is in air-conditioned room with permanent
>>>> >> >> > > > external
>>>> >> >> > > > temperature around 18°C
>>>> >> >> > > >
>>>> >> >> > > >
>>>> >> >> > > > Thanks a lot in advance for any comment/experience !
>>>> >> >> > > >
>>>> >> >> > > > Best wishes,
>>>> >> >> > > >
>>>> >> >> > > > Marek
>>>> >> >> > > >
>>>> >> >> > > > --
>>>> >> >> > > > Tato zpráva byla vytvořena převratným poštovním klientem
>>>> >> >> > > > Opery:
>>>> >> >> > > > http://www.opera.com/mail/
>>>> >> >> > > >
>>>> >> >> > > > ______________________________**_________________
>>>> >> >> > > > AMBER mailing list
>>>> >> >> > > > AMBER.ambermd.org
>>>> >> >> > > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >> >> > > >
>>>> >> >> > > ______________________________**_________________
>>>> >> >> > > AMBER mailing list
>>>> >> >> > > AMBER.ambermd.org
>>>> >> >> > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >> >> >
>>>> >> >> > ______________________________**_________________
>>>> >> >> > AMBER mailing list
>>>> >> >> > AMBER.ambermd.org
>>>> >> >> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >> >> >
>>>> >> >> ______________________________**_________________
>>>> >> >> AMBER mailing list
>>>> >> >> AMBER.ambermd.org
>>>> >> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >> >>
>>>> >> > ______________________________**_________________
>>>> >> > AMBER mailing list
>>>> >> > AMBER.ambermd.org
>>>> >> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >> >
>>>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8385
>>>> >> > (20130528) __________
>>>> >> >
>>>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>> >> >
>>>> >> > http://www.eset.cz
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> >> http://www.opera.com/mail/
>>>> >>
>>>> >> ______________________________**_________________
>>>> >> AMBER mailing list
>>>> >> AMBER.ambermd.org
>>>> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>
>>>> > ______________________________**_________________
>>>> > AMBER mailing list
>>>> > AMBER.ambermd.org
>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >
>>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>>> > (20130528) __________
>>>> >
>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>> >
>>>> > http://www.eset.cz
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>> --
>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> http://www.opera.com/mail/
>>>>
>>>> ______________________________**_________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>
>>>> ______________________________**_________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>
>>>
>>> ______________________________**_________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>
>>
>>
>>
>> ______________________________**_________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>
>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>> (20130528) __________
>>
>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>
>> http://www.eset.cz
>>
>>
>>
>>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 14:00:03 PDT