Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Wed, 29 May 2013 22:11:12 +0200

Hi all,

First of all thanks to Ross for his update ! although it is question
if it helps to solve all the reported Amber issues with Titan/OC Titan
GPUs .
So let's see and hope :))

Here are my results - see the attached TXT file with tables where
the results from the tests are summarised. I did twice the same
Amber benchmark tests on each GPU (both titans, GTX 680 and GTX 580)
to see reproducibility of the results after 100K steps at ig=default
  (so ig not present in mdin file).

The first table contains ns/day estimates obtained for each molecular
system
for each TITAN GPU. Interestingly estimates obtained for the same system
in different
round slightly differ, but maybe that's OK.

In the second table there are values of the total energy after 100k steps
to check
reproducibility of the results.

Here is summarisation :

#1 - simulation crashes on TITANs

Interestingly there was just one simulation crash in JAC_NPT (TITAN_0,
ROUND_1) the remaining
3 TITAN JAC_NPT simulations were finished. There were also 3 times crashes
in CELLULOSE_NVE
test but the last simulation (TITAN_1,ROUND_2) was finished without any
problem. All the remaining
simulations were always finished without any problem. So the simulation
crashes seem to be
not-reproducible/unpredictible on some moleacular systems/(mdin setups).

CRASH ERRORS:

a) JAC_NPT (TITAN_0, ROUND_1)
Here 11k steps were successfully done before crash, I found this error
in mdout file:

| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!

b) CELLULOSE_NVE (TITAN_0, ROUND_1, ROUND_2; TITAN_1, ROUND_1 )
Here I did not find any error in mdout file. Just this error was written
on standard output
(screen/nohup.out file):

------
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
grep: mdinfo.1GTX_TITAN: No such file or directory
-----

in all three cases.

Here on CELLULOSE_NVE case I started to play with NTPR parameter
(originally just
on TITAN-0 GPU), to see how many steps were successfully done here before
crash, then this my
small research started to be more interesting than I ever thought :)) see
here
chronologically my results for E_tot after 2000 steps for different GPUs
(machines) - I repeated calculation several times for the given NTPR just
to be sure.

TITAN-0, Etot after 2000 steps

NTPR=10

  -443256.6867
  -443256.6867
  -443256.6867

NTPR=100

  -443250.1350
  -443250.1350
  -443250.1350

NTPR=200

  -443261.0705
  -443261.0705
  -443072.3097
  -443261.0705
  -443261.0705
  -443261.0705
  -443261.0705

NTPR=10 (again just to verify)

-443256.6867
-443256.6867


Then I tried with TITAN-1

NTPR=10

-443256.6867
-443256.6867

NTPR=100

-443250.1350
-443250.1350

NTPR=200

-443261.0705
-443261.0705


Then I tried with GTX-580

NTPR=10

-443256.6867
-443256.6867

NTPR=200

-443261.0705
-443261.0705

then I tried with GTX-680

NTPR=10 Etot after 2000 steps

  -443256.6711
  -443256.6711

NTPR=200 Etot after 2000 steps

-443261.0705
-443261.0705

Any idea why energies should depend on frequency of energy records (NTPR) ?



#2 - reproducibility on TITANs (see attached table.txt)

Also here are differences depending on concrete systems/setups.
While in case of FACTOR_IX_NVE, FACTOR_IX_NPT, TRPCAGE, MYOGLOBIN systems
I have obtained
100% reproducibility (the results for the given system were identical for
both cards/all ROUNDs)
  on systems JAC_NVE, JAC_NPT, NUCLEOSOME I obtained small differences in
general however in case
  of TITAN_1 GPU also NUCLEOSOME results were 100% reproducible. Moreover
for the TITAN_1 card which succeeded to finish CELLULOSE test at least in
ROUND_2 I did 3rd additional round and I got the identical result as from
the ROUND_2 (i.e. -443246.3206 ) so regarding TITAN_1 GPU I can say that
it is able to 100% reproduce 100k steps CELLULOSE_NVE test result perhaps
on all eventually successfully finished runs :))


#3 - GTX-580, GTX-680 controls

Here the simulations were done without any problems and were 100%
reproducible on each card however
the results for the given system slightly differ between those two cards
with exception of the
CELLULOSE system where both cards GTX-580, GTX-680 provided identical
result which is moreover
nearly identical with result obtained with TITAN_1 during ROUND_2
(relative difference 2e-6).


TO ET:
a)
I had no problems with minimization stages in my own simul. bigger than
100k which crashed
during heat NVT phase.

b)
313.30 driver ??? OK so after 319.23 I will try experiment with this a bit
"outdated" version :))
Actually I am working under 319.17. (and CUDA 5.0)

c)
Can you please do at least JAC_NPT, JAC_NVE, NUCLEOSOME and CELLULOSE_NVE
tests using 100 000 steps
(same random seed e.g. default = ig deleted from mdin if is there) twice
to confirm 100% reproducibility on your TITAN GPU ?

TO Divi:

This is also my usual approach to divide whole simulation into many
subtrajectories (in my case 0.5 ns = 250k 2fs steps) but it does not seem
to help here it self. Can you please also do the same tests which I asked
ET (point c) )


BTW CUDA release candidate 5.5 was just released (
https://developer.nvidia.com/cuda-toolkit )
would it be reasonable idea to try compile/run pmemd.cuda with this brand
new cuda version ?

   Thanks !

        Best wishes,

                  Marek






Dne Wed, 29 May 2013 03:44:33 +0200 Ross Walker <ross.rosswalker.co.uk>
napsal/-a:

> Hi All,
>
> Just an update that we will have some fixes out soon that address some
> errors we have been noticing with simulations crashing during NPT runs.
> It
> is possible that this is confusing the issue here as to whether the
> problem is related to the GTX Titan or to a possible bug in the code. I
> hope to have the patch released within a few days at which point it would
> be good to repeat these tests and then hopefully we can try to track down
> what is going on. I find it hard to believe that so many cards are faulty
> so I suspect that there may be something funky in the code with regards
> to
> GTX Titans. We'll try and get it fixed as soon as possible but for now
> please just wait until we get the update released for AMBER 12 in a few
> days and see if that helps at all.
>
> All the best
> Ross
>
>
> On 5/28/13 5:12 PM, "Divi/GMAIL" <dvenkatlu.gmail.com> wrote:
>
>> I have two TITANs in my Gigabyte workstation. I have had similar
>> issues
>> of NANs for some of the simulation setups. Never could figure out why
>> the
>> simulations failed for no reason. I tried 10, 12 ang. box sizes. same
>> random breakdowns. Thought of returning them suspecting memory errors.
>> But
>> some simulations ran perfectly fine. Currently running two calculations
>> without any problems. Both are running pretty stable for over 100ns. I
>> suspect AMBER CUDA code may have some issues under some simulation
>> conditions such as NPT. In general, NVT setup is more successful than
>> NPT,
>> in my case.
>>
>> These are 287426 atoms simulation on one card (9 ns/day)
>> On other card: 129049 atom setup (20 ns/day)
>>
>> Both using same NVT setup. (AMBER12/INTEL-12.x
>> compilers/CentOS-6.3/Drivers 319.17/CUDA5.0)
>>
>> Input is below:
>> &cntrl
>> nstlim=500000, dt=0.002,
>> ntx=5, irest=1, ig=-1,
>> ntpr=1000, ntwr=10000, ntwx=10000,
>> ntt=1, tautp=2, ntb=1, ntp=0, ntc=2, ntf=2,
>> iwrap=1, ioutfm=1, ntxo=2,
>> &end
>>
>> One suggestion If I may add: If you could run short simulations for no
>> more
>> than 500,000 steps (or 1ns with 2 fs), you might find some stability.
>> Again,
>> not scientific rationale from my side. But it worked in some cases for
>> me.
>>
>> This is self-assembled system with GIGABYTE GA-Z77X-UP7 (with core i5
>> processor) and 1200W PS/16GB memory.
>>
>>
>> Best regards
>> Divi
>>
>>
>>
>> -----Original Message-----
>> From: Scott Le Grand
>> Sent: Tuesday, May 28, 2013 4:46 PM
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>> memtestG80 - UNDERclocking in Linux ?
>>
>> You can play Russian Roulette a whole bunch of rounds without blowing
>> your
>> head off.
>>
>> Similarly, when you have a GPU that occasionally flips a bit the wrong
>> way,
>> most of the time it will be some low order perturbation to the
>> coordinates
>> that does little more than make the trajectory nondeterministic...
>> Except
>> when it doesn't...
>>
>> You can't even detect this kind of misbehavior in GROMACS, ACEMD, or
>> NAMD
>> because *none* of them (to my knowledge) are capable of producing
>> deterministic output at production-level performance.
>>
>> Titans and 680s are consumer cards. I love them to death, but if you're
>> going to do production work with them, you need to qual them thoroughly
>> before proceeding or you need to pay up and use Teslas instead. I'd
>> still
>> build a cluster with Titans myself, but I'd ruthlessly RMA them until I
>> got
>> satisfaction if they couldn't pass a test consisting of running an AMBER
>> simulation for 100K iterations without either crashing or producing a
>> nondeterministic result. The customer is always right.
>>
>>
>> On Tue, May 28, 2013 at 1:20 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>>
>>> I would wait for the results of my GOPU0, GPU1 double tests before
>>> any serious conclusions.
>>>
>>> BTW what exactly means "GPU is hosed" ? Something like GPU is damaged
>>> or
>>> so ?
>>>
>>> Also would be strange (not probable) to buy 2 somehow damaged GPUs
>>> (even
>>> in the same way).
>>>
>>> As I wrote, memtestG80 tests were negative on both cards, if moreover
>>> both cards will perfectly reproduce both repetitions of the Amber
>>> benchmarks
>>> and eventually pass some another GPU tests (can you recommend any
>>> except
>>> memtestG80 ?)
>>> I still believe that the GPU cards are OK (also thank to particular
>>> successes in my Amb. simulations and actual A. benchmarks). So maybe I
>>> will eventually try downclock, but there might be some another
>>> variables,
>>> e.g. driver, OS, motherboard (I will perhaps test one card in another
>>> MB
>>> just to be sure, that problem is not MB based) etc. that's why I asked
>>> before that guy "ET" for the info about driver version, would be also
>>> interesting OS info or MB.
>>>
>>> M.
>>>
>>>
>>>
>>>
>>>
>>> Dne Tue, 28 May 2013 22:13:36 +0200 Scott Le Grand
>>> <varelse2005.gmail.com>
>>> napsal/-a:
>>>
>>> > Marek,
>>> > Your GPU is hosed. I don't have anything else to add. I'm not going
>>> to
>>> > go
>>> > snark hunting for a bug that doesn't exist.
>>> >
>>> >
>>> >
>>> > On Tue, May 28, 2013 at 12:24 PM, Marek Maly <marek.maly.ujep.cz>
>>> wrote:
>>> >
>>> >> Hi, just for the curiosity which driver are you using
>>> >> on that machine with perfectly working with OC TITAN,
>>> >> 319.17 or some more actual e.g. 319.23 ?
>>> >>
>>> >> RMA is a good idea but it could be also long time story and
>>> >> also to succeed here you need to have strong arguments
>>> >> especially if you are going to RMA two OC TITANs.
>>> >>
>>> >> I am not sure if my arguments "The cards have problems with some
>>> Amber
>>> >> calculations"
>>> >> would be strong enough here. Would be much better to have clear
>>> results
>>> >> from
>>> >> respected GPU tests and as it seems you may do extensive GPU tests
>>> also
>>> >> with
>>> >> multiple routines without any errors but still have problems with
>>> >> particular
>>> >> Amber simulations...
>>> >>
>>> >> BTW I am now doing Amber benchmarks with nstlim=100K and ig=default
>>> for
>>> >> each card
>>> >> twice. The tests will be done in cca 3 hours (due to slow nucleosome
>>> GB
>>> >> test).
>>> >>
>>> >> But even now I have interesting results from the first test on GPU0
>>> >> (nucleosome is still running) see below.
>>> >>
>>> >> As you can see JAC_NPT crashed around 11000 step, here is the last
>>> >> md.out
>>> >> record:
>>> >>
>>> >> *********
>>> >>
>>> >>
>>>
>>> -------------------------------------------------------------------------
>>> -----
>>> >>
>>> >> check COM velocity, temp: 0.000021 0.00(Removed)
>>> >>
>>> >> NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39
>>> PRESS
>>> >> =
>>> >> -9.4
>>> >> Etot = -58092.8958 EKtot = 14440.2520 EPtot =
>>> >> -72533.1478
>>> >> BOND = 443.3912 ANGLE = 1253.5177 DIHED =
>>> >> 970.1275
>>> >> 1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
>>> >> 8664.9960
>>> >> EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
>>> >> 0.0000
>>> >> EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
>>> >> 236141.9494
>>> >> Density =
>>> >> 1.0162
>>> >>
>>> >>
>>>
>>> -------------------------------------------------------------------------
>>> -----
>>> >>
>>> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>> sphere
>>> >> radius!
>>> >>
>>> >> ********
>>> >>
>>> >> Any idea about that ERROR ?
>>> >>
>>> >> On the other hand FACTOR_IX_NPT which has much more atoms passed
>>> >> without
>>> >> any issue.
>>> >>
>>> >> Cellulose crashed on the beginning without any ERROR message in
>>> md.out
>>> >> file.
>>> >>
>>> >>
>>> >> I am very curious regarding exact reproducibility of the results at
>>> >> least
>>> >> in the
>>> >> framework of both tests on individual cards.
>>> >>
>>> >> BTW regarding eventual downclocking, has anyone idea about some
>>> NVclock
>>> >> alternative or
>>> >> I will be really eventually forced to edit frequency value in GPU
>>> BIOS
>>> >> ?
>>> >>
>>> >> Best,
>>> >>
>>> >> Marek
>>> >>
>>> >> HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests
>>> >>
>>> >> JAC_PRODUCTION_NVE - 23,558 atoms PME
>>> >> -------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: | ns/day = 115.91 seconds/ns =
>>> >> 745.39
>>> >>
>>> >> JAC_PRODUCTION_NPT - 23,558 atoms PME
>>> >> -------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
>>> >> | ns/day = 90.72 seconds/ns = 952.42
>>> >>
>>> >> FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
>>> >> -------------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: | ns/day = 30.56
>>> seconds/ns =
>>> >> 2827.33
>>> >>
>>> >> FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
>>> >> -------------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: | ns/day = 25.01
>>> seconds/ns =
>>> >> 3454.56
>>> >>
>>> >> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>> >> --------------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: Error: unspecified launch failure launching
>>> >> kernel
>>> >> kNLSkinTest
>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>> >> grep: mdinfo.1GTX_TITAN: No such file or directory
>>> >>
>>> >> TRPCAGE_PRODUCTION - 304 atoms GB
>>> >> ---------------------------------
>>> >> 1 x GTX_TITAN: | ns/day = 595.09 seconds/ns =
>>> >> 145.19
>>> >>
>>> >> MYOGLOBIN_PRODUCTION - 2,492 atoms GB
>>> >> -------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN: | ns/day = 202.56 seconds/ns =
>>> >> 426.53
>>> >>
>>> >> NUCLEOSOME_PRODUCTION - 25,095 atoms GB
>>> >> ---------------------------------------
>>> >>
>>> >> 1 x GTX_TITAN:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com>
>>> napsal/-a:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I just got a superclocked Titan and one at normal freq. The first
>>> one
>>> >> ran
>>> >> > like a charm with no issues so far. The other standard clocked one
>>> >> could
>>> >> > never get past the constant pressure stage in an NPT simulation.
>>> It
>>> >> kept
>>> >> > writing NAN or ********* in the outfile. I swapped them about in
>>> the
>>> >> pcie
>>> >> > lanes then ran it solo in each one of the lanes. Despite all this
>>> it
>>> >> was
>>> >> > still failing the benchmark that the other one had no problems
>>> with.
>>> >> >
>>> >> > I couldn't find any memory errors with GPU-burn either, but as
>>> they
>>> >> cost
>>> >> > near a grand a piece, I RMA'd it today. I recommend you to do the
>>> >> same if
>>> >> > its not giving you any joy. Life's too short. :)
>>> >> >
>>> >> > br,
>>> >> > g
>>> >> >
>>> >> >
>>> >> > On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com>
>>> wrote:
>>> >> >
>>> >> >> AMBER != NAMD...
>>> >> >>
>>> >> >> GTX 680 != GTX Titan...
>>> >> >>
>>> >> >> Ian's suggestion is a good one. But even then, you need to test
>>> >> >> your
>>> >> >> GPUs
>>> >> >> as the Titans are running right on the edge of stability. Like I
>>> >> told
>>> >> >> Marek, try running 100K iterations of Cellulose NVE twice with
>>> the
>>> >> same
>>> >> >> random seed. if you don't get identically bit accurate output,
>>> your
>>> >> >> GPU is
>>> >> >> not working. Memtest programs do not catch this because (I am
>>> >> guessing)
>>> >> >> they are designed for a uniform memory hierarchy and only one
>>> path
>>> >> >> to
>>> >> >> read
>>> >> >> and write data. I have a stock GTX Titan that cannot pass the
>>> >> Cellulose
>>> >> >> NVE test and another one that does. I spent a couple days on the
>>> >> former
>>> >> >> GPU looking for the imaginary bug that went away like magic the
>>> >> second I
>>> >> >> switched out the GPU.
>>> >> >>
>>> >> >> Scott
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu>
>>> wrote:
>>> >> >>
>>> >> >> > Hi Scott,
>>> >> >> >
>>> >> >> > unfortunately we are seeing similar Amber instability on GTX
>>> >> Titans as
>>> >> >> > Marek is. We have a box with four GTX Titans (not oveclocked)
>>> >> running
>>> >> >> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
>>> >> >> simulation
>>> >> >> > longer than 10-15 min eventually crashes on these cards,
>>> including
>>> >> >> both
>>> >> >> JAC
>>> >> >> > benchmarks (with extended run time). This is reproducible on
>>> all
>>> >> four
>>> >> >> > cards.
>>> >> >> >
>>> >> >> > To eliminate the possible hardware error we ran extended GPU
>>> >> >> > memory
>>> >> >> tests
>>> >> >> > on all four Titans with memtestG80, cuda_memtest and also
>>> gpu_burn
>>> >> -
>>> >> >> all
>>> >> >> > finished without errors. Since I agree that these programs may
>>> not
>>> >> >> test
>>> >> >> the
>>> >> >> > GPU completely we also set up simulations with NAMD. We can run
>>> >> four
>>> >> >> NAMD
>>> >> >> > simulations simultaneously for many days without any errors on
>>> >> >> > this
>>> >> >> > hardware. For reference - we also have exactly the same server
>>> >> >> > with
>>> >> >> the
>>> >> >> > same hardware components but with four GTX680s and this setup
>>> >> >> > works
>>> >> >> just
>>> >> >> > fine for Amber. So all this leads me to believe that a hardware
>>> >> error
>>> >> >> is
>>> >> >> > not very likely.
>>> >> >> >
>>> >> >> > I would appreciate your comments on this, perhaps there is
>>> >> something
>>> >> >> else
>>> >> >> > causing these errors which we are not seeing.
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> >
>>> >> >> > Robert
>>> >> >> >
>>> >> >> >
>>> >> >> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
>>> >> >> > > I have two GTX Titans. One is defective, the other is not.
>>> >> >> > Unfortunately,
>>> >> >> > > they both pass all standard GPU memory tests.
>>> >> >> > >
>>> >> >> > > What the defective one doesn't do is generate reproducibly
>>> >> >> bit-accurate
>>> >> >> > > outputs for simulations of Factor IX (90,986 atoms) or
>>> larger,
>>> >> >> > > of
>>> >> >> 100K
>>> >> >> or
>>> >> >> > > so iterations.
>>> >> >> > >
>>> >> >> > > Which is yet another reason why I insist on MD algorithms
>>> >> >> (especially
>>> >> >> on
>>> >> >> > > GPUS) being deterministic. Besides its ability to find
>>> software
>>> >> >> bugs,
>>> >> >> > and
>>> >> >> > > fulfilling one of the most important tenets of science, it's
>>> a
>>> >> great
>>> >> >> way
>>> >> >> > to
>>> >> >> > > diagnose defective hardware with very little effort.
>>> >> >> > >
>>> >> >> > > 928 MHz? That's 6% above the boost clock of a stock Titan.
>>> >> Titan
>>> >> >> is
>>> >> >> > > pushing the performance envelope as is. If you're going to
>>> pay
>>> >> the
>>> >> >> > premium
>>> >> >> > > for such chips, I'd send them back until you get one that
>>> runs
>>> >> >> correctly.
>>> >> >> > > I'm very curious how fast you can push one of these things
>>> >> >> > > before
>>> >> >> they
>>> >> >> > give
>>> >> >> > > out.
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly
>>> <marek.maly.ujep.cz
>>> >
>>> >> >> wrote:
>>> >> >> > >
>>> >> >> > > > Dear all,
>>> >> >> > > >
>>> >> >> > > > I have recently bought two "EVGA GTX TITAN Superclocked"
>>> GPUs.
>>> >> >> > > >
>>> >> >> > > > I did the first calculations (pmemd.cuda in Amber12) with
>>> >> systems
>>> >> >> > around
>>> >> >> > > > 60K atoms without any problems (NPT, Langevin), but when I
>>> >> later
>>> >> >> tried
>>> >> >> > > > with bigger systems (around 100K atoms) I obtained
>>> "classical"
>>> >> >> > irritating
>>> >> >> > > > errors
>>> >> >> > > >
>>> >> >> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch
>>> >> failure
>>> >> >> > > >
>>> >> >> > > > just after few thousands of MD steps.
>>> >> >> > > >
>>> >> >> > > > So this was obviously the reason for memtestG80 tests.
>>> >> >> > > > ( https://simtk.org/home/memtest ).
>>> >> >> > > >
>>> >> >> > > > So I compiled memtestG80 from sources (
>>> >> memtestG80-1.1-src.tar.gz
>>> >> >> )
>>> >> >> and
>>> >> >> > > > then tested
>>> >> >> > > > just small part of memory GPU (200 MB) using 100
>>> iterations.
>>> >> >> > > >
>>> >> >> > > > On both cards I have obtained huge amount of errors but
>>> "just"
>>> >> on
>>> >> >> > > > "Random blocks:". 0 errors in all remaining tests in all
>>> >> >> iterations.
>>> >> >> > > >
>>> >> >> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
>>> >> >> > > >
>>> >> >> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so
>>> far
>>> >> >> > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
>>> >> >> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
>>> >> >> > > > True Walking zeros (8-bit): 0 errors (26 ms)
>>> >> >> > > > True Walking ones (8-bit): 0 errors (26 ms)
>>> >> >> > > > Moving Inversions (random): 0 errors (6 ms)
>>> >> >> > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>>> >> >> > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>>> >> >> > > > Random blocks: 1369863 errors (27 ms)
>>> >> >> > > > Memtest86 Modulo-20: 0 errors (215 ms)
>>> >> >> > > > Logic (one iteration): 0 errors (4 ms)
>>> >> >> > > > Logic (4 iterations): 0 errors (8 ms)
>>> >> >> > > > Logic (shared memory, one iteration): 0 errors (8
>>> ms)
>>> >> >> > > > Logic (shared-memory, 4 iterations): 0 errors (25
>>> ms)
>>> >> >> > > >
>>> >> >> > > > Final error count after 100 iterations over 200 MiB of GPU
>>> >> memory:
>>> >> >> > > > 171106710 errors
>>> >> >> > > >
>>> >> >> > > > ------------------------------------------
>>> >> >> > > >
>>> >> >> > > > I have some questions and would be really grateful for any
>>> >> >> comments.
>>> >> >> > > >
>>> >> >> > > > Regarding overclocking, using the deviceQuery I found out
>>> that
>>> >> >> under
>>> >> >> > linux
>>> >> >> > > > both cards run
>>> >> >> > > > automatically using boost shader/GPU frequency which is
>>> here
>>> >> 928
>>> >> >> MHz
>>> >> >> > (the
>>> >> >> > > > basic value for these factory OC cards is 876 MHz).
>>> >> >> > > > deviceQuery
>>> >> >> > reported
>>> >> >> > > > Memory Clock rate is 3004 MHz although "it" should be 6008
>>> MHz
>>> >> but
>>> >> >> > maybe
>>> >> >> > > > the quantity which is reported by deviceQuery "Memory Clock
>>> >> rate"
>>> >> >> is
>>> >> >> > > > different from the product specification "Memory Clock" .
>>> It
>>> >> seems
>>> >> >> that
>>> >> >> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or
>>> just
>>> >> >> > deviceQuery
>>> >> >> > > > is not able to read this spec. properly
>>> >> >> > > > in Titan GPU ?
>>> >> >> > > >
>>> >> >> > > > Anyway for the moment I assume that the problem might be
>>> due
>>> >> >> > > > to
>>> >> >> the
>>> >> >> > high
>>> >> >> > > > shader/GPU frequency.
>>> >> >> > > > (see here :
>>> http://folding.stanford.edu/English/DownloadUtils)
>>> >> >> > > >
>>> >> >> > > > To verify this hypothesis one should perhaps UNDERclock to
>>> >> basic
>>> >> >> > frequency
>>> >> >> > > > which is in this
>>> >> >> > > > model 876 MHz or even to the TITAN REFERENCE frequency
>>> which
>>> >> >> > > > is
>>> >> >> 837
>>> >> >> > MHz.
>>> >> >> > > >
>>> >> >> > > > Obviously I am working with these cards under linux (CentOS
>>> >> >> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools
>>> under
>>> >> >> linux
>>> >> >> > are in
>>> >> >> > > > fact limited just to NVclock utility, which is
>>> unfortunately
>>> >> >> > > > out of date (at least speaking about the GTX Titan ). I
>>> have
>>> >> >> obtained
>>> >> >> > this
>>> >> >> > > > message when I wanted
>>> >> >> > > > just to let NVclock utility to read and print shader and
>>> >> >> > > > memory
>>> >> >> > > > frequencies of my Titan's:
>>> >> >> > > >
>>> >> >> > > >
>>> >> >>
>>> -------------------------------------------------------------------
>>> >> >> > > >
>>> >> >> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>>> >> >> > > > Card: Unknown Nvidia card
>>> >> >> > > > Card number: 1
>>> >> >> > > > Memory clock: -2147483.750 MHz
>>> >> >> > > > GPU clock: -2147483.750 MHz
>>> >> >> > > >
>>> >> >> > > > Card: Unknown Nvidia card
>>> >> >> > > > Card number: 2
>>> >> >> > > > Memory clock: -2147483.750 MHz
>>> >> >> > > > GPU clock: -2147483.750 MHz
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > >
>>> >> >>
>>> -------------------------------------------------------------------
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > I would be really grateful for some tips regarding
>>> "NVclock
>>> >> >> > alternatives",
>>> >> >> > > > but after wasting some hours with googling it seems that
>>> there
>>> >> is
>>> >> >> no
>>> >> >> > other
>>> >> >> > > > Linux
>>> >> >> > > > tool with NVclock functionality. So the only possibility is
>>> >> here
>>> >> >> > perhaps
>>> >> >> > > > to edit
>>> >> >> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS
>>> >> >> > > > Tweaker,
>>> >> >> > NVflash)
>>> >> >> > > > but obviously
>>> >> >> > > > I would like to rather avoid such approach as using it
>>> means
>>> >> >> perhaps
>>> >> >> > also
>>> >> >> > > > to void the warranty even if I am going to underclock the
>>> GPUs
>>> >> >> not to
>>> >> >> > > > overclock them.
>>> >> >> > > > So before this eventual step (GPU bios editing) I would
>>> like
>>> >> >> > > > to
>>> >> >> have
>>> >> >> > some
>>> >> >> > > > approximative estimate
>>> >> >> > > > of the probability, that the problems are here really
>>> because
>>> >> of
>>> >> >> the
>>> >> >> > > > overclocking
>>> >> >> > > > (too high (boost) default shader frequency).
>>> >> >> > > >
>>> >> >> > > > This probability I hope to estimate from the eventual
>>> >> responses of
>>> >> >> > another
>>> >> >> > > > Amber/Titan SC users, if I am not the only crazy guy who
>>> >> >> > > > bought
>>> >> >> this
>>> >> >> > model
>>> >> >> > > > for Amber calculations :)) But of course any eventual
>>> >> experiences
>>> >> >> with
>>> >> >> > > > Titan cards related to their memtestG80 results and
>>> >> >> UNDER/OVERclocking
>>> >> >> > > > (if possible in Linux OS) are of course welcomed as well !
>>> >> >> > > >
>>> >> >> > > > My HW/SW configuration
>>> >> >> > > >
>>> >> >> > > > motherboard: ASUS P9X79 PRO
>>> >> >> > > > CPU: Intel Core i7-3930K
>>> >> >> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>>> >> >> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
>>> >> >> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>>> >> >> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>>> >> >> > > > cooler: Cooler Master Hyper 412 SLIM
>>> >> >> > > >
>>> >> >> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>>> >> >> > > > driver version: 319.17
>>> >> >> > > > cudatoolkit_5.0.35_linux_64_rhel6.x
>>> >> >> > > >
>>> >> >> > > > The computer is in air-conditioned room with permanent
>>> >> >> > > > external
>>> >> >> > > > temperature around 18°C
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > Thanks a lot in advance for any comment/experience !
>>> >> >> > > >
>>> >> >> > > > Best wishes,
>>> >> >> > > >
>>> >> >> > > > Marek
>>> >> >> > > >
>>> >> >> > > > --
>>> >> >> > > > Tato zpráva byla vytvořena převratným poštovním klientem
>>> >> >> > > > Opery:
>>> >> >> > > > http://www.opera.com/mail/
>>> >> >> > > >
>>> >> >> > > > _______________________________________________
>>> >> >> > > > AMBER mailing list
>>> >> >> > > > AMBER.ambermd.org
>>> >> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>>> >> >> > > >
>>> >> >> > > _______________________________________________
>>> >> >> > > AMBER mailing list
>>> >> >> > > AMBER.ambermd.org
>>> >> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > AMBER mailing list
>>> >> >> > AMBER.ambermd.org
>>> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >> >> >
>>> >> >> _______________________________________________
>>> >> >> AMBER mailing list
>>> >> >> AMBER.ambermd.org
>>> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>>> >> >>
>>> >> > _______________________________________________
>>> >> > AMBER mailing list
>>> >> > AMBER.ambermd.org
>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >> >
>>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8385
>>> >> > (20130528) __________
>>> >> >
>>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>> >> >
>>> >> > http://www.eset.cz
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >> --
>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> >> http://www.opera.com/mail/
>>> >>
>>> >> _______________________________________________
>>> >> AMBER mailing list
>>> >> AMBER.ambermd.org
>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>> >>
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>> > (20130528) __________
>>> >
>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>> >
>>> > http://www.eset.cz
>>> >
>>> >
>>> >
>>>
>>>
>>> --
>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> http://www.opera.com/mail/
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
> (20130528) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/



TITANES - ns/day

GPU_0 JAC_NVE JAC_NPT FACTOR_IX_NVE FACTOR_IX_NPT CELLULOSE_NVE TRPCAGE MYOGLOBIN NUCLEOSOME
ROUND_1 115.91 ERR 30.56 25.01 ERR 595.09 202.56 3.45
ROUND_2 109.41 85.73 30.27 24.95 ERR 623.96 201.16 3.45
GPU_1
ROUND_1 114.92 85.97 29.85 24.56 ERR 599.20 195.91 3.40
ROUND_2 106.44 83.63 29.63 24.43 7.05 585.14 197.48 3.40


Total energy at step 100000

*TITAN_0 JAC_NVE JAC_NPT FACTOR_IX_NVE FACTOR_IX_NPT CELLULOSE_NVE TRPCAGE MYOGLOBIN NUCLEOSOME
ROUND_1 -58137.8526 ERR -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66858.7444
ROUND_2 -58140.5142 -58159.9873 -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66792.2804
*TITAN_1
ROUND_1 -58139.8792 -58147.8714 -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66858.7444
ROUND_2 -58141.8652 -58150.9792 -234189.5802 -234370.3688 -443246.3206 -238.0523 -1429.6137 -66858.7444
*GTX_680
ROUND_1 -58139.7224 -58190.8157 -234184.6576 -234360.2490 -443246.3519 -238.0523 -1429.6137 -66841.1887
ROUND_2 -58139.7224 -58190.8157 -234184.6576 -234360.2490 -443246.3519 -238.0523 -1429.6137 -66841.1887
*GTX_580
ROUND_1 -58139.8773 -58158.3432 -234186.3908 -234391.0005 -443246.3519 -242.7692 -1366.9785 -66801.3274
ROUND_2 -58139.8773 -58158.3432 -234186.3908 -234391.0005 -443246.3519 -242.7692 -1366.9785 -66801.3274







_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 13:30:02 PDT
Custom Search