Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-05-29 (Amber Archive May 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 29 May 2013 13:46:58 -0700

Ps try running for 100k steps before comparing energies and I suspect no
two simulations will match.
On May 29, 2013 1:41 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:

> Your Titan setup is hosed. Your results were not 100% deterministic for
> the same inputs.
>
> Energies + Forces use a different subroutine than just Forces hence the
> ntpr dependence. Hence changing ntpr effectively is changing the input.
>
> It's 100% ironclad reproducibility that matters and you demonstrated it's
> not happening.
> On May 29, 2013 1:30 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>
>> Hi all,
>>
>> First of all thanks to Ross for his update ! although it is question
>> if it helps to solve all the reported Amber issues with Titan/OC Titan
>> GPUs .
>> So let's see and hope :))
>>
>> Here are my results - see the attached TXT file with tables where
>> the results from the tests are summarised. I did twice the same
>> Amber benchmark tests on each GPU (both titans, GTX 680 and GTX 580)
>> to see reproducibility of the results after 100K steps at ig=default
>> (so ig not present in mdin file).
>>
>> The first table contains ns/day estimates obtained for each molecular
>> system
>> for each TITAN GPU. Interestingly estimates obtained for the same system
>> in different
>> round slightly differ, but maybe that's OK.
>>
>> In the second table there are values of the total energy after 100k steps
>> to check
>> reproducibility of the results.
>>
>> Here is summarisation :
>>
>> #1 - simulation crashes on TITANs
>>
>> Interestingly there was just one simulation crash in JAC_NPT (TITAN_0,
>> ROUND_1) the remaining
>> 3 TITAN JAC_NPT simulations were finished. There were also 3 times
>> crashes in CELLULOSE_NVE
>> test but the last simulation (TITAN_1,ROUND_2) was finished without any
>> problem. All the remaining
>> simulations were always finished without any problem. So the simulation
>> crashes seem to be
>> not-reproducible/unpredictible on some moleacular systems/(mdin setups).
>>
>> CRASH ERRORS:
>>
>> a) JAC_NPT (TITAN_0, ROUND_1)
>> Here 11k steps were successfully done before crash, I found this error
>> in mdout file:
>>
>> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>> radius!
>>
>> b) CELLULOSE_NVE (TITAN_0, ROUND_1, ROUND_2; TITAN_1, ROUND_1 )
>> Here I did not find any error in mdout file. Just this error was written
>> on standard output
>> (screen/nohup.out file):
>>
>> ------
>> Error: unspecified launch failure launching kernel kNLSkinTest
>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>> grep: mdinfo.1GTX_TITAN: No such file or directory
>> -----
>>
>> in all three cases.
>>
>> Here on CELLULOSE_NVE case I started to play with NTPR parameter
>> (originally just
>> on TITAN-0 GPU), to see how many steps were successfully done here before
>> crash, then this my
>> small research started to be more interesting than I ever thought :)) see
>> here
>> chronologically my results for E_tot after 2000 steps for different GPUs
>> (machines) - I repeated calculation several times for the given NTPR just
>> to be sure.
>>
>> TITAN-0, Etot after 2000 steps
>>
>> NTPR=10
>>
>> -443256.6867
>> -443256.6867
>> -443256.6867
>>
>> NTPR=100
>>
>> -443250.1350
>> -443250.1350
>> -443250.1350
>>
>> NTPR=200
>>
>> -443261.0705
>> -443261.0705
>> -443072.3097
>> -443261.0705
>> -443261.0705
>> -443261.0705
>> -443261.0705
>>
>> NTPR=10 (again just to verify)
>>
>> -443256.6867
>> -443256.6867
>>
>>
>> Then I tried with TITAN-1
>>
>> NTPR=10
>>
>> -443256.6867
>> -443256.6867
>>
>> NTPR=100
>>
>> -443250.1350
>> -443250.1350
>>
>> NTPR=200
>>
>> -443261.0705
>> -443261.0705
>>
>>
>> Then I tried with GTX-580
>>
>> NTPR=10
>>
>> -443256.6867
>> -443256.6867
>>
>> NTPR=200
>>
>> -443261.0705
>> -443261.0705
>>
>> then I tried with GTX-680
>>
>> NTPR=10 Etot after 2000 steps
>>
>> -443256.6711
>> -443256.6711
>>
>> NTPR=200 Etot after 2000 steps
>>
>> -443261.0705
>> -443261.0705
>>
>> Any idea why energies should depend on frequency of energy records (NTPR)
>> ?
>>
>>
>>
>> #2 - reproducibility on TITANs (see attached table.txt)
>>
>> Also here are differences depending on concrete systems/setups.
>> While in case of FACTOR_IX_NVE, FACTOR_IX_NPT, TRPCAGE, MYOGLOBIN systems
>> I have obtained
>> 100% reproducibility (the results for the given system were identical for
>> both cards/all ROUNDs)
>> on systems JAC_NVE, JAC_NPT, NUCLEOSOME I obtained small differences in
>> general however in case
>> of TITAN_1 GPU also NUCLEOSOME results were 100% reproducible. Moreover
>> for the TITAN_1 card which succeeded to finish CELLULOSE test at least in
>> ROUND_2 I did 3rd additional round and I got the identical result as from
>> the ROUND_2 (i.e. -443246.3206 ) so regarding TITAN_1 GPU I can say that
>> it is able to 100% reproduce 100k steps CELLULOSE_NVE test result perhaps
>> on all eventually successfully finished runs :))
>>
>>
>> #3 - GTX-580, GTX-680 controls
>>
>> Here the simulations were done without any problems and were 100%
>> reproducible on each card however
>> the results for the given system slightly differ between those two cards
>> with exception of the
>> CELLULOSE system where both cards GTX-580, GTX-680 provided identical
>> result which is moreover
>> nearly identical with result obtained with TITAN_1 during ROUND_2
>> (relative difference 2e-6).
>>
>>
>> TO ET:
>> a)
>> I had no problems with minimization stages in my own simul. bigger than
>> 100k which crashed
>> during heat NVT phase.
>>
>> b)
>> 313.30 driver ??? OK so after 319.23 I will try experiment with this a
>> bit "outdated" version :))
>> Actually I am working under 319.17. (and CUDA 5.0)
>>
>> c)
>> Can you please do at least JAC_NPT, JAC_NVE, NUCLEOSOME and CELLULOSE_NVE
>> tests using 100 000 steps
>> (same random seed e.g. default = ig deleted from mdin if is there) twice
>> to confirm 100% reproducibility on your TITAN GPU ?
>>
>> TO Divi:
>>
>> This is also my usual approach to divide whole simulation into many
>> subtrajectories (in my case 0.5 ns = 250k 2fs steps) but it does not seem
>> to help here it self. Can you please also do the same tests which I asked
>> ET (point c) )
>>
>>
>> BTW CUDA release candidate 5.5 was just released (
>> https://developer.nvidia.com/**cuda-toolkit<https://developer.nvidia.com/cuda-toolkit>)
>> would it be reasonable idea to try compile/run pmemd.cuda with this brand
>> new cuda version ?
>>
>> Thanks !
>>
>> Best wishes,
>>
>> Marek
>>
>>
>>
>>
>>
>>
>> Dne Wed, 29 May 2013 03:44:33 +0200 Ross Walker <ross.rosswalker.co.uk>
>> napsal/-a:
>>
>> Hi All,
>>>
>>> Just an update that we will have some fixes out soon that address some
>>> errors we have been noticing with simulations crashing during NPT runs.
>>> It
>>> is possible that this is confusing the issue here as to whether the
>>> problem is related to the GTX Titan or to a possible bug in the code. I
>>> hope to have the patch released within a few days at which point it would
>>> be good to repeat these tests and then hopefully we can try to track down
>>> what is going on. I find it hard to believe that so many cards are faulty
>>> so I suspect that there may be something funky in the code with regards
>>> to
>>> GTX Titans. We'll try and get it fixed as soon as possible but for now
>>> please just wait until we get the update released for AMBER 12 in a few
>>> days and see if that helps at all.
>>>
>>> All the best
>>> Ross
>>>
>>>
>>> On 5/28/13 5:12 PM, "Divi/GMAIL" <dvenkatlu.gmail.com> wrote:
>>>
>>> I have two TITANs in my Gigabyte workstation. I have had similar
>>>> issues
>>>> of NANs for some of the simulation setups. Never could figure out why
>>>> the
>>>> simulations failed for no reason. I tried 10, 12 ang. box sizes. same
>>>> random breakdowns. Thought of returning them suspecting memory errors.
>>>> But
>>>> some simulations ran perfectly fine. Currently running two calculations
>>>> without any problems. Both are running pretty stable for over 100ns. I
>>>> suspect AMBER CUDA code may have some issues under some simulation
>>>> conditions such as NPT. In general, NVT setup is more successful than
>>>> NPT,
>>>> in my case.
>>>>
>>>> These are 287426 atoms simulation on one card (9 ns/day)
>>>> On other card: 129049 atom setup (20 ns/day)
>>>>
>>>> Both using same NVT setup. (AMBER12/INTEL-12.x
>>>> compilers/CentOS-6.3/Drivers 319.17/CUDA5.0)
>>>>
>>>> Input is below:
>>>> &cntrl
>>>> nstlim=500000, dt=0.002,
>>>> ntx=5, irest=1, ig=-1,
>>>> ntpr=1000, ntwr=10000, ntwx=10000,
>>>> ntt=1, tautp=2, ntb=1, ntp=0, ntc=2, ntf=2,
>>>> iwrap=1, ioutfm=1, ntxo=2,
>>>> &end
>>>>
>>>> One suggestion If I may add: If you could run short simulations for no
>>>> more
>>>> than 500,000 steps (or 1ns with 2 fs), you might find some stability.
>>>> Again,
>>>> not scientific rationale from my side. But it worked in some cases for
>>>> me.
>>>>
>>>> This is self-assembled system with GIGABYTE GA-Z77X-UP7 (with core i5
>>>> processor) and 1200W PS/16GB memory.
>>>>
>>>>
>>>> Best regards
>>>> Divi
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Scott Le Grand
>>>> Sent: Tuesday, May 28, 2013 4:46 PM
>>>> To: AMBER Mailing List
>>>> Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>>> memtestG80 - UNDERclocking in Linux ?
>>>>
>>>> You can play Russian Roulette a whole bunch of rounds without blowing
>>>> your
>>>> head off.
>>>>
>>>> Similarly, when you have a GPU that occasionally flips a bit the wrong
>>>> way,
>>>> most of the time it will be some low order perturbation to the
>>>> coordinates
>>>> that does little more than make the trajectory nondeterministic...
>>>> Except
>>>> when it doesn't...
>>>>
>>>> You can't even detect this kind of misbehavior in GROMACS, ACEMD, or
>>>> NAMD
>>>> because *none* of them (to my knowledge) are capable of producing
>>>> deterministic output at production-level performance.
>>>>
>>>> Titans and 680s are consumer cards. I love them to death, but if you're
>>>> going to do production work with them, you need to qual them thoroughly
>>>> before proceeding or you need to pay up and use Teslas instead. I'd
>>>> still
>>>> build a cluster with Titans myself, but I'd ruthlessly RMA them until I
>>>> got
>>>> satisfaction if they couldn't pass a test consisting of running an AMBER
>>>> simulation for 100K iterations without either crashing or producing a
>>>> nondeterministic result. The customer is always right.
>>>>
>>>>
>>>> On Tue, May 28, 2013 at 1:20 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>>>>
>>>> I would wait for the results of my GOPU0, GPU1 double tests before
>>>>> any serious conclusions.
>>>>>
>>>>> BTW what exactly means "GPU is hosed" ? Something like GPU is damaged
>>>>> or
>>>>> so ?
>>>>>
>>>>> Also would be strange (not probable) to buy 2 somehow damaged GPUs
>>>>> (even
>>>>> in the same way).
>>>>>
>>>>> As I wrote, memtestG80 tests were negative on both cards, if moreover
>>>>> both cards will perfectly reproduce both repetitions of the Amber
>>>>> benchmarks
>>>>> and eventually pass some another GPU tests (can you recommend any
>>>>> except
>>>>> memtestG80 ?)
>>>>> I still believe that the GPU cards are OK (also thank to particular
>>>>> successes in my Amb. simulations and actual A. benchmarks). So maybe I
>>>>> will eventually try downclock, but there might be some another
>>>>> variables,
>>>>> e.g. driver, OS, motherboard (I will perhaps test one card in another
>>>>> MB
>>>>> just to be sure, that problem is not MB based) etc. that's why I asked
>>>>> before that guy "ET" for the info about driver version, would be also
>>>>> interesting OS info or MB.
>>>>>
>>>>> M.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dne Tue, 28 May 2013 22:13:36 +0200 Scott Le Grand
>>>>> <varelse2005.gmail.com>
>>>>> napsal/-a:
>>>>>
>>>>> > Marek,
>>>>> > Your GPU is hosed. I don't have anything else to add. I'm not going
>>>>> to
>>>>> > go
>>>>> > snark hunting for a bug that doesn't exist.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, May 28, 2013 at 12:24 PM, Marek Maly <marek.maly.ujep.cz>
>>>>> wrote:
>>>>> >
>>>>> >> Hi, just for the curiosity which driver are you using
>>>>> >> on that machine with perfectly working with OC TITAN,
>>>>> >> 319.17 or some more actual e.g. 319.23 ?
>>>>> >>
>>>>> >> RMA is a good idea but it could be also long time story and
>>>>> >> also to succeed here you need to have strong arguments
>>>>> >> especially if you are going to RMA two OC TITANs.
>>>>> >>
>>>>> >> I am not sure if my arguments "The cards have problems with some
>>>>> Amber
>>>>> >> calculations"
>>>>> >> would be strong enough here. Would be much better to have clear
>>>>> results
>>>>> >> from
>>>>> >> respected GPU tests and as it seems you may do extensive GPU tests
>>>>> also
>>>>> >> with
>>>>> >> multiple routines without any errors but still have problems with
>>>>> >> particular
>>>>> >> Amber simulations...
>>>>> >>
>>>>> >> BTW I am now doing Amber benchmarks with nstlim=100K and ig=default
>>>>> for
>>>>> >> each card
>>>>> >> twice. The tests will be done in cca 3 hours (due to slow nucleosome
>>>>> GB
>>>>> >> test).
>>>>> >>
>>>>> >> But even now I have interesting results from the first test on GPU0
>>>>> >> (nucleosome is still running) see below.
>>>>> >>
>>>>> >> As you can see JAC_NPT crashed around 11000 step, here is the last
>>>>> >> md.out
>>>>> >> record:
>>>>> >>
>>>>> >> *********
>>>>> >>
>>>>> >>
>>>>>
>>>>> ------------------------------**------------------------------**
>>>>> -------------
>>>>> -----
>>>>> >>
>>>>> >> check COM velocity, temp: 0.000021 0.00(Removed)
>>>>> >>
>>>>> >> NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39
>>>>> PRESS
>>>>> >> =
>>>>> >> -9.4
>>>>> >> Etot = -58092.8958 EKtot = 14440.2520 EPtot =
>>>>> >> -72533.1478
>>>>> >> BOND = 443.3912 ANGLE = 1253.5177 DIHED =
>>>>> >> 970.1275
>>>>> >> 1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
>>>>> >> 8664.9960
>>>>> >> EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
>>>>> >> 0.0000
>>>>> >> EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
>>>>> >> 236141.9494
>>>>> >> Density =
>>>>> >> 1.0162
>>>>> >>
>>>>> >>
>>>>>
>>>>> ------------------------------**------------------------------**
>>>>> -------------
>>>>> -----
>>>>> >>
>>>>> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>>>> sphere
>>>>> >> radius!
>>>>> >>
>>>>> >> ********
>>>>> >>
>>>>> >> Any idea about that ERROR ?
>>>>> >>
>>>>> >> On the other hand FACTOR_IX_NPT which has much more atoms passed
>>>>> >> without
>>>>> >> any issue.
>>>>> >>
>>>>> >> Cellulose crashed on the beginning without any ERROR message in
>>>>> md.out
>>>>> >> file.
>>>>> >>
>>>>> >>
>>>>> >> I am very curious regarding exact reproducibility of the results at
>>>>> >> least
>>>>> >> in the
>>>>> >> framework of both tests on individual cards.
>>>>> >>
>>>>> >> BTW regarding eventual downclocking, has anyone idea about some
>>>>> NVclock
>>>>> >> alternative or
>>>>> >> I will be really eventually forced to edit frequency value in GPU
>>>>> BIOS
>>>>> >> ?
>>>>> >>
>>>>> >> Best,
>>>>> >>
>>>>> >> Marek
>>>>> >>
>>>>> >> HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests
>>>>> >>
>>>>> >> JAC_PRODUCTION_NVE - 23,558 atoms PME
>>>>> >> ------------------------------**-------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: | ns/day = 115.91 seconds/ns =
>>>>> >> 745.39
>>>>> >>
>>>>> >> JAC_PRODUCTION_NPT - 23,558 atoms PME
>>>>> >> ------------------------------**-------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
>>>>> >> | ns/day = 90.72 seconds/ns = 952.42
>>>>> >>
>>>>> >> FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
>>>>> >> ------------------------------**-------------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: | ns/day = 30.56
>>>>> seconds/ns =
>>>>> >> 2827.33
>>>>> >>
>>>>> >> FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
>>>>> >> ------------------------------**-------------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: | ns/day = 25.01
>>>>> seconds/ns =
>>>>> >> 3454.56
>>>>> >>
>>>>> >> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>>>> >> ------------------------------**--------------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: Error: unspecified launch failure launching
>>>>> >> kernel
>>>>> >> kNLSkinTest
>>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>> >> grep: mdinfo.1GTX_TITAN: No such file or directory
>>>>> >>
>>>>> >> TRPCAGE_PRODUCTION - 304 atoms GB
>>>>> >> ------------------------------**---
>>>>> >> 1 x GTX_TITAN: | ns/day = 595.09 seconds/ns =
>>>>> >> 145.19
>>>>> >>
>>>>> >> MYOGLOBIN_PRODUCTION - 2,492 atoms GB
>>>>> >> ------------------------------**-------
>>>>> >>
>>>>> >> 1 x GTX_TITAN: | ns/day = 202.56 seconds/ns =
>>>>> >> 426.53
>>>>> >>
>>>>> >> NUCLEOSOME_PRODUCTION - 25,095 atoms GB
>>>>> >> ------------------------------**---------
>>>>> >>
>>>>> >> 1 x GTX_TITAN:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com>
>>>>> napsal/-a:
>>>>> >>
>>>>> >> > Hi,
>>>>> >> >
>>>>> >> > I just got a superclocked Titan and one at normal freq. The first
>>>>> one
>>>>> >> ran
>>>>> >> > like a charm with no issues so far. The other standard clocked one
>>>>> >> could
>>>>> >> > never get past the constant pressure stage in an NPT simulation.
>>>>> It
>>>>> >> kept
>>>>> >> > writing NAN or ********* in the outfile. I swapped them about in
>>>>> the
>>>>> >> pcie
>>>>> >> > lanes then ran it solo in each one of the lanes. Despite all this
>>>>> it
>>>>> >> was
>>>>> >> > still failing the benchmark that the other one had no problems
>>>>> with.
>>>>> >> >
>>>>> >> > I couldn't find any memory errors with GPU-burn either, but as
>>>>> they
>>>>> >> cost
>>>>> >> > near a grand a piece, I RMA'd it today. I recommend you to do the
>>>>> >> same if
>>>>> >> > its not giving you any joy. Life's too short. :)
>>>>> >> >
>>>>> >> > br,
>>>>> >> > g
>>>>> >> >
>>>>> >> >
>>>>> >> > On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com>
>>>>> wrote:
>>>>> >> >
>>>>> >> >> AMBER != NAMD...
>>>>> >> >>
>>>>> >> >> GTX 680 != GTX Titan...
>>>>> >> >>
>>>>> >> >> Ian's suggestion is a good one. But even then, you need to test
>>>>> >> >> your
>>>>> >> >> GPUs
>>>>> >> >> as the Titans are running right on the edge of stability. Like I
>>>>> >> told
>>>>> >> >> Marek, try running 100K iterations of Cellulose NVE twice with
>>>>> the
>>>>> >> same
>>>>> >> >> random seed. if you don't get identically bit accurate output,
>>>>> your
>>>>> >> >> GPU is
>>>>> >> >> not working. Memtest programs do not catch this because (I am
>>>>> >> guessing)
>>>>> >> >> they are designed for a uniform memory hierarchy and only one
>>>>> path
>>>>> >> >> to
>>>>> >> >> read
>>>>> >> >> and write data. I have a stock GTX Titan that cannot pass the
>>>>> >> Cellulose
>>>>> >> >> NVE test and another one that does. I spent a couple days on the
>>>>> >> former
>>>>> >> >> GPU looking for the imaginary bug that went away like magic the
>>>>> >> second I
>>>>> >> >> switched out the GPU.
>>>>> >> >>
>>>>> >> >> Scott
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> > Hi Scott,
>>>>> >> >> >
>>>>> >> >> > unfortunately we are seeing similar Amber instability on GTX
>>>>> >> Titans as
>>>>> >> >> > Marek is. We have a box with four GTX Titans (not oveclocked)
>>>>> >> running
>>>>> >> >> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any Amber
>>>>> >> >> simulation
>>>>> >> >> > longer than 10-15 min eventually crashes on these cards,
>>>>> including
>>>>> >> >> both
>>>>> >> >> JAC
>>>>> >> >> > benchmarks (with extended run time). This is reproducible on
>>>>> all
>>>>> >> four
>>>>> >> >> > cards.
>>>>> >> >> >
>>>>> >> >> > To eliminate the possible hardware error we ran extended GPU
>>>>> >> >> > memory
>>>>> >> >> tests
>>>>> >> >> > on all four Titans with memtestG80, cuda_memtest and also
>>>>> gpu_burn
>>>>> >> -
>>>>> >> >> all
>>>>> >> >> > finished without errors. Since I agree that these programs may
>>>>> not
>>>>> >> >> test
>>>>> >> >> the
>>>>> >> >> > GPU completely we also set up simulations with NAMD. We can run
>>>>> >> four
>>>>> >> >> NAMD
>>>>> >> >> > simulations simultaneously for many days without any errors on
>>>>> >> >> > this
>>>>> >> >> > hardware. For reference - we also have exactly the same server
>>>>> >> >> > with
>>>>> >> >> the
>>>>> >> >> > same hardware components but with four GTX680s and this setup
>>>>> >> >> > works
>>>>> >> >> just
>>>>> >> >> > fine for Amber. So all this leads me to believe that a hardware
>>>>> >> error
>>>>> >> >> is
>>>>> >> >> > not very likely.
>>>>> >> >> >
>>>>> >> >> > I would appreciate your comments on this, perhaps there is
>>>>> >> something
>>>>> >> >> else
>>>>> >> >> > causing these errors which we are not seeing.
>>>>> >> >> >
>>>>> >> >> > Thanks,
>>>>> >> >> >
>>>>> >> >> > Robert
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand wrote:
>>>>> >> >> > > I have two GTX Titans. One is defective, the other is not.
>>>>> >> >> > Unfortunately,
>>>>> >> >> > > they both pass all standard GPU memory tests.
>>>>> >> >> > >
>>>>> >> >> > > What the defective one doesn't do is generate reproducibly
>>>>> >> >> bit-accurate
>>>>> >> >> > > outputs for simulations of Factor IX (90,986 atoms) or
>>>>> larger,
>>>>> >> >> > > of
>>>>> >> >> 100K
>>>>> >> >> or
>>>>> >> >> > > so iterations.
>>>>> >> >> > >
>>>>> >> >> > > Which is yet another reason why I insist on MD algorithms
>>>>> >> >> (especially
>>>>> >> >> on
>>>>> >> >> > > GPUS) being deterministic. Besides its ability to find
>>>>> software
>>>>> >> >> bugs,
>>>>> >> >> > and
>>>>> >> >> > > fulfilling one of the most important tenets of science, it's
>>>>> a
>>>>> >> great
>>>>> >> >> way
>>>>> >> >> > to
>>>>> >> >> > > diagnose defective hardware with very little effort.
>>>>> >> >> > >
>>>>> >> >> > > 928 MHz? That's 6% above the boost clock of a stock Titan.
>>>>> >> Titan
>>>>> >> >> is
>>>>> >> >> > > pushing the performance envelope as is. If you're going to
>>>>> pay
>>>>> >> the
>>>>> >> >> > premium
>>>>> >> >> > > for such chips, I'd send them back until you get one that
>>>>> runs
>>>>> >> >> correctly.
>>>>> >> >> > > I'm very curious how fast you can push one of these things
>>>>> >> >> > > before
>>>>> >> >> they
>>>>> >> >> > give
>>>>> >> >> > > out.
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly
>>>>> <marek.maly.ujep.cz
>>>>> >
>>>>> >> >> wrote:
>>>>> >> >> > >
>>>>> >> >> > > > Dear all,
>>>>> >> >> > > >
>>>>> >> >> > > > I have recently bought two "EVGA GTX TITAN Superclocked"
>>>>> GPUs.
>>>>> >> >> > > >
>>>>> >> >> > > > I did the first calculations (pmemd.cuda in Amber12) with
>>>>> >> systems
>>>>> >> >> > around
>>>>> >> >> > > > 60K atoms without any problems (NPT, Langevin), but when I
>>>>> >> later
>>>>> >> >> tried
>>>>> >> >> > > > with bigger systems (around 100K atoms) I obtained
>>>>> "classical"
>>>>> >> >> > irritating
>>>>> >> >> > > > errors
>>>>> >> >> > > >
>>>>> >> >> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch
>>>>> >> failure
>>>>> >> >> > > >
>>>>> >> >> > > > just after few thousands of MD steps.
>>>>> >> >> > > >
>>>>> >> >> > > > So this was obviously the reason for memtestG80 tests.
>>>>> >> >> > > > ( https://simtk.org/home/memtest ).
>>>>> >> >> > > >
>>>>> >> >> > > > So I compiled memtestG80 from sources (
>>>>> >> memtestG80-1.1-src.tar.gz
>>>>> >> >> )
>>>>> >> >> and
>>>>> >> >> > > > then tested
>>>>> >> >> > > > just small part of memory GPU (200 MB) using 100
>>>>> iterations.
>>>>> >> >> > > >
>>>>> >> >> > > > On both cards I have obtained huge amount of errors but
>>>>> "just"
>>>>> >> on
>>>>> >> >> > > > "Random blocks:". 0 errors in all remaining tests in all
>>>>> >> >> iterations.
>>>>> >> >> > > >
>>>>> >> >> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
>>>>> >> >> > > >
>>>>> >> >> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so
>>>>> far
>>>>> >> >> > > > Moving Inversions (ones and zeros): 0 errors (6 ms)
>>>>> >> >> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
>>>>> >> >> > > > True Walking zeros (8-bit): 0 errors (26 ms)
>>>>> >> >> > > > True Walking ones (8-bit): 0 errors (26 ms)
>>>>> >> >> > > > Moving Inversions (random): 0 errors (6 ms)
>>>>> >> >> > > > Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
>>>>> >> >> > > > Memtest86 Walking ones (32-bit): 0 errors (104 ms)
>>>>> >> >> > > > Random blocks: 1369863 errors (27 ms)
>>>>> >> >> > > > Memtest86 Modulo-20: 0 errors (215 ms)
>>>>> >> >> > > > Logic (one iteration): 0 errors (4 ms)
>>>>> >> >> > > > Logic (4 iterations): 0 errors (8 ms)
>>>>> >> >> > > > Logic (shared memory, one iteration): 0 errors (8
>>>>> ms)
>>>>> >> >> > > > Logic (shared-memory, 4 iterations): 0 errors (25
>>>>> ms)
>>>>> >> >> > > >
>>>>> >> >> > > > Final error count after 100 iterations over 200 MiB of GPU
>>>>> >> memory:
>>>>> >> >> > > > 171106710 errors
>>>>> >> >> > > >
>>>>> >> >> > > > ------------------------------**------------
>>>>> >> >> > > >
>>>>> >> >> > > > I have some questions and would be really grateful for any
>>>>> >> >> comments.
>>>>> >> >> > > >
>>>>> >> >> > > > Regarding overclocking, using the deviceQuery I found out
>>>>> that
>>>>> >> >> under
>>>>> >> >> > linux
>>>>> >> >> > > > both cards run
>>>>> >> >> > > > automatically using boost shader/GPU frequency which is
>>>>> here
>>>>> >> 928
>>>>> >> >> MHz
>>>>> >> >> > (the
>>>>> >> >> > > > basic value for these factory OC cards is 876 MHz).
>>>>> >> >> > > > deviceQuery
>>>>> >> >> > reported
>>>>> >> >> > > > Memory Clock rate is 3004 MHz although "it" should be 6008
>>>>> MHz
>>>>> >> but
>>>>> >> >> > maybe
>>>>> >> >> > > > the quantity which is reported by deviceQuery "Memory Clock
>>>>> >> rate"
>>>>> >> >> is
>>>>> >> >> > > > different from the product specification "Memory Clock" .
>>>>> It
>>>>> >> seems
>>>>> >> >> that
>>>>> >> >> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or
>>>>> just
>>>>> >> >> > deviceQuery
>>>>> >> >> > > > is not able to read this spec. properly
>>>>> >> >> > > > in Titan GPU ?
>>>>> >> >> > > >
>>>>> >> >> > > > Anyway for the moment I assume that the problem might be
>>>>> due
>>>>> >> >> > > > to
>>>>> >> >> the
>>>>> >> >> > high
>>>>> >> >> > > > shader/GPU frequency.
>>>>> >> >> > > > (see here :
>>>>> http://folding.stanford.edu/**English/DownloadUtils<http://folding.stanford.edu/English/DownloadUtils>
>>>>> )
>>>>> >> >> > > >
>>>>> >> >> > > > To verify this hypothesis one should perhaps UNDERclock to
>>>>> >> basic
>>>>> >> >> > frequency
>>>>> >> >> > > > which is in this
>>>>> >> >> > > > model 876 MHz or even to the TITAN REFERENCE frequency
>>>>> which
>>>>> >> >> > > > is
>>>>> >> >> 837
>>>>> >> >> > MHz.
>>>>> >> >> > > >
>>>>> >> >> > > > Obviously I am working with these cards under linux (CentOS
>>>>> >> >> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools
>>>>> under
>>>>> >> >> linux
>>>>> >> >> > are in
>>>>> >> >> > > > fact limited just to NVclock utility, which is
>>>>> unfortunately
>>>>> >> >> > > > out of date (at least speaking about the GTX Titan ). I
>>>>> have
>>>>> >> >> obtained
>>>>> >> >> > this
>>>>> >> >> > > > message when I wanted
>>>>> >> >> > > > just to let NVclock utility to read and print shader and
>>>>> >> >> > > > memory
>>>>> >> >> > > > frequencies of my Titan's:
>>>>> >> >> > > >
>>>>> >> >> > > >
>>>>> >> >>
>>>>> ------------------------------**------------------------------**
>>>>> -------
>>>>> >> >> > > >
>>>>> >> >> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>>>>> >> >> > > > Card: Unknown Nvidia card
>>>>> >> >> > > > Card number: 1
>>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>>> >> >> > > >
>>>>> >> >> > > > Card: Unknown Nvidia card
>>>>> >> >> > > > Card number: 2
>>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>>> >> >> > > >
>>>>> >> >> > > >
>>>>> >> >> > > >
>>>>> >> >>
>>>>> ------------------------------**------------------------------**
>>>>> -------
>>>>> >> >> > > >
>>>>> >> >> > > >
>>>>> >> >> > > > I would be really grateful for some tips regarding
>>>>> "NVclock
>>>>> >> >> > alternatives",
>>>>> >> >> > > > but after wasting some hours with googling it seems that
>>>>> there
>>>>> >> is
>>>>> >> >> no
>>>>> >> >> > other
>>>>> >> >> > > > Linux
>>>>> >> >> > > > tool with NVclock functionality. So the only possibility is
>>>>> >> here
>>>>> >> >> > perhaps
>>>>> >> >> > > > to edit
>>>>> >> >> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS
>>>>> >> >> > > > Tweaker,
>>>>> >> >> > NVflash)
>>>>> >> >> > > > but obviously
>>>>> >> >> > > > I would like to rather avoid such approach as using it
>>>>> means
>>>>> >> >> perhaps
>>>>> >> >> > also
>>>>> >> >> > > > to void the warranty even if I am going to underclock the
>>>>> GPUs
>>>>> >> >> not to
>>>>> >> >> > > > overclock them.
>>>>> >> >> > > > So before this eventual step (GPU bios editing) I would
>>>>> like
>>>>> >> >> > > > to
>>>>> >> >> have
>>>>> >> >> > some
>>>>> >> >> > > > approximative estimate
>>>>> >> >> > > > of the probability, that the problems are here really
>>>>> because
>>>>> >> of
>>>>> >> >> the
>>>>> >> >> > > > overclocking
>>>>> >> >> > > > (too high (boost) default shader frequency).
>>>>> >> >> > > >
>>>>> >> >> > > > This probability I hope to estimate from the eventual
>>>>> >> responses of
>>>>> >> >> > another
>>>>> >> >> > > > Amber/Titan SC users, if I am not the only crazy guy who
>>>>> >> >> > > > bought
>>>>> >> >> this
>>>>> >> >> > model
>>>>> >> >> > > > for Amber calculations :)) But of course any eventual
>>>>> >> experiences
>>>>> >> >> with
>>>>> >> >> > > > Titan cards related to their memtestG80 results and
>>>>> >> >> UNDER/OVERclocking
>>>>> >> >> > > > (if possible in Linux OS) are of course welcomed as well !
>>>>> >> >> > > >
>>>>> >> >> > > > My HW/SW configuration
>>>>> >> >> > > >
>>>>> >> >> > > > motherboard: ASUS P9X79 PRO
>>>>> >> >> > > > CPU: Intel Core i7-3930K
>>>>> >> >> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>>>>> >> >> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
>>>>> >> >> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>>>>> >> >> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>>>>> >> >> > > > cooler: Cooler Master Hyper 412 SLIM
>>>>> >> >> > > >
>>>>> >> >> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>>>>> >> >> > > > driver version: 319.17
>>>>> >> >> > > > cudatoolkit_5.0.35_linux_64_**rhel6.x
>>>>> >> >> > > >
>>>>> >> >> > > > The computer is in air-conditioned room with permanent
>>>>> >> >> > > > external
>>>>> >> >> > > > temperature around 18°C
>>>>> >> >> > > >
>>>>> >> >> > > >
>>>>> >> >> > > > Thanks a lot in advance for any comment/experience !
>>>>> >> >> > > >
>>>>> >> >> > > > Best wishes,
>>>>> >> >> > > >
>>>>> >> >> > > > Marek
>>>>> >> >> > > >
>>>>> >> >> > > > --
>>>>> >> >> > > > Tato zpráva byla vytvořena převratným poštovním klientem
>>>>> >> >> > > > Opery:
>>>>> >> >> > > > http://www.opera.com/mail/
>>>>> >> >> > > >
>>>>> >> >> > > > ______________________________**_________________
>>>>> >> >> > > > AMBER mailing list
>>>>> >> >> > > > AMBER.ambermd.org
>>>>> >> >> > > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >> >> > > >
>>>>> >> >> > > ______________________________**_________________
>>>>> >> >> > > AMBER mailing list
>>>>> >> >> > > AMBER.ambermd.org
>>>>> >> >> > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >> >> >
>>>>> >> >> > ______________________________**_________________
>>>>> >> >> > AMBER mailing list
>>>>> >> >> > AMBER.ambermd.org
>>>>> >> >> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >> >> >
>>>>> >> >> ______________________________**_________________
>>>>> >> >> AMBER mailing list
>>>>> >> >> AMBER.ambermd.org
>>>>> >> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >> >>
>>>>> >> > ______________________________**_________________
>>>>> >> > AMBER mailing list
>>>>> >> > AMBER.ambermd.org
>>>>> >> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >> >
>>>>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8385
>>>>> >> > (20130528) __________
>>>>> >> >
>>>>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>> >> >
>>>>> >> > http://www.eset.cz
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>> >> http://www.opera.com/mail/
>>>>> >>
>>>>> >> ______________________________**_________________
>>>>> >> AMBER mailing list
>>>>> >> AMBER.ambermd.org
>>>>> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >>
>>>>> > ______________________________**_________________
>>>>> > AMBER mailing list
>>>>> > AMBER.ambermd.org
>>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >
>>>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>>>> > (20130528) __________
>>>>> >
>>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>> >
>>>>> > http://www.eset.cz
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>> --
>>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>> http://www.opera.com/mail/
>>>>>
>>>>> ______________________________**_________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>
>>>>> ______________________________**_________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>
>>>>
>>>> ______________________________**_________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>
>>>
>>>
>>>
>>> ______________________________**_________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>
>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>> (20130528) __________
>>>
>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>
>>> http://www.eset.cz
>>>
>>>
>>>
>>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 14:00:04 PDT