Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Wed, 29 May 2013 23:22:34 +0200

Hi Scott,

what do you mean by "try running for 100k steps before comparing
energies". In all
tests I have actually done I did exactly 100k steps before comparing
energies
E_tot(at step 100 000). So you mean to extend tests to 200k steps now ?

   M.


Dne Wed, 29 May 2013 22:46:58 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> Ps try running for 100k steps before comparing energies and I suspect no
> two simulations will match.
> On May 29, 2013 1:41 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
>
>> Your Titan setup is hosed. Your results were not 100% deterministic for
>> the same inputs.
>>
>> Energies + Forces use a different subroutine than just Forces hence the
>> ntpr dependence. Hence changing ntpr effectively is changing the input.
>>
>> It's 100% ironclad reproducibility that matters and you demonstrated
>> it's
>> not happening.
>> On May 29, 2013 1:30 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>>
>>> Hi all,
>>>
>>> First of all thanks to Ross for his update ! although it is question
>>> if it helps to solve all the reported Amber issues with Titan/OC Titan
>>> GPUs .
>>> So let's see and hope :))
>>>
>>> Here are my results - see the attached TXT file with tables where
>>> the results from the tests are summarised. I did twice the same
>>> Amber benchmark tests on each GPU (both titans, GTX 680 and GTX 580)
>>> to see reproducibility of the results after 100K steps at ig=default
>>> (so ig not present in mdin file).
>>>
>>> The first table contains ns/day estimates obtained for each molecular
>>> system
>>> for each TITAN GPU. Interestingly estimates obtained for the same
>>> system
>>> in different
>>> round slightly differ, but maybe that's OK.
>>>
>>> In the second table there are values of the total energy after 100k
>>> steps
>>> to check
>>> reproducibility of the results.
>>>
>>> Here is summarisation :
>>>
>>> #1 - simulation crashes on TITANs
>>>
>>> Interestingly there was just one simulation crash in JAC_NPT (TITAN_0,
>>> ROUND_1) the remaining
>>> 3 TITAN JAC_NPT simulations were finished. There were also 3 times
>>> crashes in CELLULOSE_NVE
>>> test but the last simulation (TITAN_1,ROUND_2) was finished without any
>>> problem. All the remaining
>>> simulations were always finished without any problem. So the simulation
>>> crashes seem to be
>>> not-reproducible/unpredictible on some moleacular systems/(mdin
>>> setups).
>>>
>>> CRASH ERRORS:
>>>
>>> a) JAC_NPT (TITAN_0, ROUND_1)
>>> Here 11k steps were successfully done before crash, I found this error
>>> in mdout file:
>>>
>>> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>>> radius!
>>>
>>> b) CELLULOSE_NVE (TITAN_0, ROUND_1, ROUND_2; TITAN_1, ROUND_1 )
>>> Here I did not find any error in mdout file. Just this error was
>>> written
>>> on standard output
>>> (screen/nohup.out file):
>>>
>>> ------
>>> Error: unspecified launch failure launching kernel kNLSkinTest
>>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>> grep: mdinfo.1GTX_TITAN: No such file or directory
>>> -----
>>>
>>> in all three cases.
>>>
>>> Here on CELLULOSE_NVE case I started to play with NTPR parameter
>>> (originally just
>>> on TITAN-0 GPU), to see how many steps were successfully done here
>>> before
>>> crash, then this my
>>> small research started to be more interesting than I ever thought :))
>>> see
>>> here
>>> chronologically my results for E_tot after 2000 steps for different
>>> GPUs
>>> (machines) - I repeated calculation several times for the given NTPR
>>> just
>>> to be sure.
>>>
>>> TITAN-0, Etot after 2000 steps
>>>
>>> NTPR=10
>>>
>>> -443256.6867
>>> -443256.6867
>>> -443256.6867
>>>
>>> NTPR=100
>>>
>>> -443250.1350
>>> -443250.1350
>>> -443250.1350
>>>
>>> NTPR=200
>>>
>>> -443261.0705
>>> -443261.0705
>>> -443072.3097
>>> -443261.0705
>>> -443261.0705
>>> -443261.0705
>>> -443261.0705
>>>
>>> NTPR=10 (again just to verify)
>>>
>>> -443256.6867
>>> -443256.6867
>>>
>>>
>>> Then I tried with TITAN-1
>>>
>>> NTPR=10
>>>
>>> -443256.6867
>>> -443256.6867
>>>
>>> NTPR=100
>>>
>>> -443250.1350
>>> -443250.1350
>>>
>>> NTPR=200
>>>
>>> -443261.0705
>>> -443261.0705
>>>
>>>
>>> Then I tried with GTX-580
>>>
>>> NTPR=10
>>>
>>> -443256.6867
>>> -443256.6867
>>>
>>> NTPR=200
>>>
>>> -443261.0705
>>> -443261.0705
>>>
>>> then I tried with GTX-680
>>>
>>> NTPR=10 Etot after 2000 steps
>>>
>>> -443256.6711
>>> -443256.6711
>>>
>>> NTPR=200 Etot after 2000 steps
>>>
>>> -443261.0705
>>> -443261.0705
>>>
>>> Any idea why energies should depend on frequency of energy records
>>> (NTPR)
>>> ?
>>>
>>>
>>>
>>> #2 - reproducibility on TITANs (see attached table.txt)
>>>
>>> Also here are differences depending on concrete systems/setups.
>>> While in case of FACTOR_IX_NVE, FACTOR_IX_NPT, TRPCAGE, MYOGLOBIN
>>> systems
>>> I have obtained
>>> 100% reproducibility (the results for the given system were identical
>>> for
>>> both cards/all ROUNDs)
>>> on systems JAC_NVE, JAC_NPT, NUCLEOSOME I obtained small differences
>>> in
>>> general however in case
>>> of TITAN_1 GPU also NUCLEOSOME results were 100% reproducible.
>>> Moreover
>>> for the TITAN_1 card which succeeded to finish CELLULOSE test at least
>>> in
>>> ROUND_2 I did 3rd additional round and I got the identical result as
>>> from
>>> the ROUND_2 (i.e. -443246.3206 ) so regarding TITAN_1 GPU I can say
>>> that
>>> it is able to 100% reproduce 100k steps CELLULOSE_NVE test result
>>> perhaps
>>> on all eventually successfully finished runs :))
>>>
>>>
>>> #3 - GTX-580, GTX-680 controls
>>>
>>> Here the simulations were done without any problems and were 100%
>>> reproducible on each card however
>>> the results for the given system slightly differ between those two
>>> cards
>>> with exception of the
>>> CELLULOSE system where both cards GTX-580, GTX-680 provided identical
>>> result which is moreover
>>> nearly identical with result obtained with TITAN_1 during ROUND_2
>>> (relative difference 2e-6).
>>>
>>>
>>> TO ET:
>>> a)
>>> I had no problems with minimization stages in my own simul. bigger than
>>> 100k which crashed
>>> during heat NVT phase.
>>>
>>> b)
>>> 313.30 driver ??? OK so after 319.23 I will try experiment with this a
>>> bit "outdated" version :))
>>> Actually I am working under 319.17. (and CUDA 5.0)
>>>
>>> c)
>>> Can you please do at least JAC_NPT, JAC_NVE, NUCLEOSOME and
>>> CELLULOSE_NVE
>>> tests using 100 000 steps
>>> (same random seed e.g. default = ig deleted from mdin if is there)
>>> twice
>>> to confirm 100% reproducibility on your TITAN GPU ?
>>>
>>> TO Divi:
>>>
>>> This is also my usual approach to divide whole simulation into many
>>> subtrajectories (in my case 0.5 ns = 250k 2fs steps) but it does not
>>> seem
>>> to help here it self. Can you please also do the same tests which I
>>> asked
>>> ET (point c) )
>>>
>>>
>>> BTW CUDA release candidate 5.5 was just released (
>>> https://developer.nvidia.com/**cuda-toolkit<https://developer.nvidia.com/cuda-toolkit>)
>>> would it be reasonable idea to try compile/run pmemd.cuda with this
>>> brand
>>> new cuda version ?
>>>
>>> Thanks !
>>>
>>> Best wishes,
>>>
>>> Marek
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dne Wed, 29 May 2013 03:44:33 +0200 Ross Walker <ross.rosswalker.co.uk>
>>> napsal/-a:
>>>
>>> Hi All,
>>>>
>>>> Just an update that we will have some fixes out soon that address some
>>>> errors we have been noticing with simulations crashing during NPT
>>>> runs.
>>>> It
>>>> is possible that this is confusing the issue here as to whether the
>>>> problem is related to the GTX Titan or to a possible bug in the code.
>>>> I
>>>> hope to have the patch released within a few days at which point it
>>>> would
>>>> be good to repeat these tests and then hopefully we can try to track
>>>> down
>>>> what is going on. I find it hard to believe that so many cards are
>>>> faulty
>>>> so I suspect that there may be something funky in the code with
>>>> regards
>>>> to
>>>> GTX Titans. We'll try and get it fixed as soon as possible but for now
>>>> please just wait until we get the update released for AMBER 12 in a
>>>> few
>>>> days and see if that helps at all.
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>
>>>> On 5/28/13 5:12 PM, "Divi/GMAIL" <dvenkatlu.gmail.com> wrote:
>>>>
>>>> I have two TITANs in my Gigabyte workstation. I have had similar
>>>>> issues
>>>>> of NANs for some of the simulation setups. Never could figure out why
>>>>> the
>>>>> simulations failed for no reason. I tried 10, 12 ang. box sizes.
>>>>> same
>>>>> random breakdowns. Thought of returning them suspecting memory
>>>>> errors.
>>>>> But
>>>>> some simulations ran perfectly fine. Currently running two
>>>>> calculations
>>>>> without any problems. Both are running pretty stable for over 100ns.
>>>>> I
>>>>> suspect AMBER CUDA code may have some issues under some simulation
>>>>> conditions such as NPT. In general, NVT setup is more successful than
>>>>> NPT,
>>>>> in my case.
>>>>>
>>>>> These are 287426 atoms simulation on one card (9 ns/day)
>>>>> On other card: 129049 atom setup (20 ns/day)
>>>>>
>>>>> Both using same NVT setup. (AMBER12/INTEL-12.x
>>>>> compilers/CentOS-6.3/Drivers 319.17/CUDA5.0)
>>>>>
>>>>> Input is below:
>>>>> &cntrl
>>>>> nstlim=500000, dt=0.002,
>>>>> ntx=5, irest=1, ig=-1,
>>>>> ntpr=1000, ntwr=10000, ntwx=10000,
>>>>> ntt=1, tautp=2, ntb=1, ntp=0, ntc=2, ntf=2,
>>>>> iwrap=1, ioutfm=1, ntxo=2,
>>>>> &end
>>>>>
>>>>> One suggestion If I may add: If you could run short simulations for
>>>>> no
>>>>> more
>>>>> than 500,000 steps (or 1ns with 2 fs), you might find some stability.
>>>>> Again,
>>>>> not scientific rationale from my side. But it worked in some cases
>>>>> for
>>>>> me.
>>>>>
>>>>> This is self-assembled system with GIGABYTE GA-Z77X-UP7 (with core
>>>>> i5
>>>>> processor) and 1200W PS/16GB memory.
>>>>>
>>>>>
>>>>> Best regards
>>>>> Divi
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Scott Le Grand
>>>>> Sent: Tuesday, May 28, 2013 4:46 PM
>>>>> To: AMBER Mailing List
>>>>> Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>>>> memtestG80 - UNDERclocking in Linux ?
>>>>>
>>>>> You can play Russian Roulette a whole bunch of rounds without blowing
>>>>> your
>>>>> head off.
>>>>>
>>>>> Similarly, when you have a GPU that occasionally flips a bit the
>>>>> wrong
>>>>> way,
>>>>> most of the time it will be some low order perturbation to the
>>>>> coordinates
>>>>> that does little more than make the trajectory nondeterministic...
>>>>> Except
>>>>> when it doesn't...
>>>>>
>>>>> You can't even detect this kind of misbehavior in GROMACS, ACEMD, or
>>>>> NAMD
>>>>> because *none* of them (to my knowledge) are capable of producing
>>>>> deterministic output at production-level performance.
>>>>>
>>>>> Titans and 680s are consumer cards. I love them to death, but if
>>>>> you're
>>>>> going to do production work with them, you need to qual them
>>>>> thoroughly
>>>>> before proceeding or you need to pay up and use Teslas instead. I'd
>>>>> still
>>>>> build a cluster with Titans myself, but I'd ruthlessly RMA them
>>>>> until I
>>>>> got
>>>>> satisfaction if they couldn't pass a test consisting of running an
>>>>> AMBER
>>>>> simulation for 100K iterations without either crashing or producing a
>>>>> nondeterministic result. The customer is always right.
>>>>>
>>>>>
>>>>> On Tue, May 28, 2013 at 1:20 PM, Marek Maly <marek.maly.ujep.cz>
>>>>> wrote:
>>>>>
>>>>> I would wait for the results of my GOPU0, GPU1 double tests before
>>>>>> any serious conclusions.
>>>>>>
>>>>>> BTW what exactly means "GPU is hosed" ? Something like GPU is
>>>>>> damaged
>>>>>> or
>>>>>> so ?
>>>>>>
>>>>>> Also would be strange (not probable) to buy 2 somehow damaged GPUs
>>>>>> (even
>>>>>> in the same way).
>>>>>>
>>>>>> As I wrote, memtestG80 tests were negative on both cards, if
>>>>>> moreover
>>>>>> both cards will perfectly reproduce both repetitions of the Amber
>>>>>> benchmarks
>>>>>> and eventually pass some another GPU tests (can you recommend any
>>>>>> except
>>>>>> memtestG80 ?)
>>>>>> I still believe that the GPU cards are OK (also thank to particular
>>>>>> successes in my Amb. simulations and actual A. benchmarks). So
>>>>>> maybe I
>>>>>> will eventually try downclock, but there might be some another
>>>>>> variables,
>>>>>> e.g. driver, OS, motherboard (I will perhaps test one card in
>>>>>> another
>>>>>> MB
>>>>>> just to be sure, that problem is not MB based) etc. that's why I
>>>>>> asked
>>>>>> before that guy "ET" for the info about driver version, would be
>>>>>> also
>>>>>> interesting OS info or MB.
>>>>>>
>>>>>> M.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dne Tue, 28 May 2013 22:13:36 +0200 Scott Le Grand
>>>>>> <varelse2005.gmail.com>
>>>>>> napsal/-a:
>>>>>>
>>>>>> > Marek,
>>>>>> > Your GPU is hosed. I don't have anything else to add. I'm not
>>>>>> going
>>>>>> to
>>>>>> > go
>>>>>> > snark hunting for a bug that doesn't exist.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, May 28, 2013 at 12:24 PM, Marek Maly <marek.maly.ujep.cz>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi, just for the curiosity which driver are you using
>>>>>> >> on that machine with perfectly working with OC TITAN,
>>>>>> >> 319.17 or some more actual e.g. 319.23 ?
>>>>>> >>
>>>>>> >> RMA is a good idea but it could be also long time story and
>>>>>> >> also to succeed here you need to have strong arguments
>>>>>> >> especially if you are going to RMA two OC TITANs.
>>>>>> >>
>>>>>> >> I am not sure if my arguments "The cards have problems with some
>>>>>> Amber
>>>>>> >> calculations"
>>>>>> >> would be strong enough here. Would be much better to have clear
>>>>>> results
>>>>>> >> from
>>>>>> >> respected GPU tests and as it seems you may do extensive GPU
>>>>>> tests
>>>>>> also
>>>>>> >> with
>>>>>> >> multiple routines without any errors but still have problems with
>>>>>> >> particular
>>>>>> >> Amber simulations...
>>>>>> >>
>>>>>> >> BTW I am now doing Amber benchmarks with nstlim=100K and
>>>>>> ig=default
>>>>>> for
>>>>>> >> each card
>>>>>> >> twice. The tests will be done in cca 3 hours (due to slow
>>>>>> nucleosome
>>>>>> GB
>>>>>> >> test).
>>>>>> >>
>>>>>> >> But even now I have interesting results from the first test on
>>>>>> GPU0
>>>>>> >> (nucleosome is still running) see below.
>>>>>> >>
>>>>>> >> As you can see JAC_NPT crashed around 11000 step, here is the
>>>>>> last
>>>>>> >> md.out
>>>>>> >> record:
>>>>>> >>
>>>>>> >> *********
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>> ------------------------------**------------------------------**
>>>>>> -------------
>>>>>> -----
>>>>>> >>
>>>>>> >> check COM velocity, temp: 0.000021 0.00(Removed)
>>>>>> >>
>>>>>> >> NSTEP = 11000 TIME(PS) = 28.000 TEMP(K) = 300.39
>>>>>> PRESS
>>>>>> >> =
>>>>>> >> -9.4
>>>>>> >> Etot = -58092.8958 EKtot = 14440.2520 EPtot =
>>>>>> >> -72533.1478
>>>>>> >> BOND = 443.3912 ANGLE = 1253.5177 DIHED =
>>>>>> >> 970.1275
>>>>>> >> 1-4 NB = 567.2497 1-4 EEL = 6586.9007 VDWAALS =
>>>>>> >> 8664.9960
>>>>>> >> EELEC = -91019.3306 EHBOND = 0.0000 RESTRAINT =
>>>>>> >> 0.0000
>>>>>> >> EKCMT = 6274.0354 VIRIAL = 6321.9969 VOLUME =
>>>>>> >> 236141.9494
>>>>>> >> Density =
>>>>>> >> 1.0162
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>> ------------------------------**------------------------------**
>>>>>> -------------
>>>>>> -----
>>>>>> >>
>>>>>> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>>>>> sphere
>>>>>> >> radius!
>>>>>> >>
>>>>>> >> ********
>>>>>> >>
>>>>>> >> Any idea about that ERROR ?
>>>>>> >>
>>>>>> >> On the other hand FACTOR_IX_NPT which has much more atoms passed
>>>>>> >> without
>>>>>> >> any issue.
>>>>>> >>
>>>>>> >> Cellulose crashed on the beginning without any ERROR message in
>>>>>> md.out
>>>>>> >> file.
>>>>>> >>
>>>>>> >>
>>>>>> >> I am very curious regarding exact reproducibility of the results
>>>>>> at
>>>>>> >> least
>>>>>> >> in the
>>>>>> >> framework of both tests on individual cards.
>>>>>> >>
>>>>>> >> BTW regarding eventual downclocking, has anyone idea about some
>>>>>> NVclock
>>>>>> >> alternative or
>>>>>> >> I will be really eventually forced to edit frequency value in GPU
>>>>>> BIOS
>>>>>> >> ?
>>>>>> >>
>>>>>> >> Best,
>>>>>> >>
>>>>>> >> Marek
>>>>>> >>
>>>>>> >> HERE ARE THE FIRST DATA FROM MY 2x2 Bench tests
>>>>>> >>
>>>>>> >> JAC_PRODUCTION_NVE - 23,558 atoms PME
>>>>>> >> ------------------------------**-------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: | ns/day = 115.91
>>>>>> seconds/ns =
>>>>>> >> 745.39
>>>>>> >>
>>>>>> >> JAC_PRODUCTION_NPT - 23,558 atoms PME
>>>>>> >> ------------------------------**-------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: STOP PMEMD Terminated Abnormally!
>>>>>> >> | ns/day = 90.72 seconds/ns = 952.42
>>>>>> >>
>>>>>> >> FACTOR_IX_PRODUCTION_NVE - 90,906 atoms PME
>>>>>> >> ------------------------------**-------------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: | ns/day = 30.56
>>>>>> seconds/ns =
>>>>>> >> 2827.33
>>>>>> >>
>>>>>> >> FACTOR_IX_PRODUCTION_NPT - 90,906 atoms PME
>>>>>> >> ------------------------------**-------------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: | ns/day = 25.01
>>>>>> seconds/ns =
>>>>>> >> 3454.56
>>>>>> >>
>>>>>> >> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>>>>> >> ------------------------------**--------------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: Error: unspecified launch failure
>>>>>> launching
>>>>>> >> kernel
>>>>>> >> kNLSkinTest
>>>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>>> >> grep: mdinfo.1GTX_TITAN: No such file or directory
>>>>>> >>
>>>>>> >> TRPCAGE_PRODUCTION - 304 atoms GB
>>>>>> >> ------------------------------**---
>>>>>> >> 1 x GTX_TITAN: | ns/day = 595.09
>>>>>> seconds/ns =
>>>>>> >> 145.19
>>>>>> >>
>>>>>> >> MYOGLOBIN_PRODUCTION - 2,492 atoms GB
>>>>>> >> ------------------------------**-------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN: | ns/day = 202.56
>>>>>> seconds/ns =
>>>>>> >> 426.53
>>>>>> >>
>>>>>> >> NUCLEOSOME_PRODUCTION - 25,095 atoms GB
>>>>>> >> ------------------------------**---------
>>>>>> >>
>>>>>> >> 1 x GTX_TITAN:
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Dne Tue, 28 May 2013 20:42:32 +0200 ET <sketchfoot.gmail.com>
>>>>>> napsal/-a:
>>>>>> >>
>>>>>> >> > Hi,
>>>>>> >> >
>>>>>> >> > I just got a superclocked Titan and one at normal freq. The
>>>>>> first
>>>>>> one
>>>>>> >> ran
>>>>>> >> > like a charm with no issues so far. The other standard clocked
>>>>>> one
>>>>>> >> could
>>>>>> >> > never get past the constant pressure stage in an NPT
>>>>>> simulation.
>>>>>> It
>>>>>> >> kept
>>>>>> >> > writing NAN or ********* in the outfile. I swapped them about
>>>>>> in
>>>>>> the
>>>>>> >> pcie
>>>>>> >> > lanes then ran it solo in each one of the lanes. Despite all
>>>>>> this
>>>>>> it
>>>>>> >> was
>>>>>> >> > still failing the benchmark that the other one had no problems
>>>>>> with.
>>>>>> >> >
>>>>>> >> > I couldn't find any memory errors with GPU-burn either, but as
>>>>>> they
>>>>>> >> cost
>>>>>> >> > near a grand a piece, I RMA'd it today. I recommend you to do
>>>>>> the
>>>>>> >> same if
>>>>>> >> > its not giving you any joy. Life's too short. :)
>>>>>> >> >
>>>>>> >> > br,
>>>>>> >> > g
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On 28 May 2013 16:57, Scott Le Grand <varelse2005.gmail.com>
>>>>>> wrote:
>>>>>> >> >
>>>>>> >> >> AMBER != NAMD...
>>>>>> >> >>
>>>>>> >> >> GTX 680 != GTX Titan...
>>>>>> >> >>
>>>>>> >> >> Ian's suggestion is a good one. But even then, you need to
>>>>>> test
>>>>>> >> >> your
>>>>>> >> >> GPUs
>>>>>> >> >> as the Titans are running right on the edge of stability.
>>>>>> Like I
>>>>>> >> told
>>>>>> >> >> Marek, try running 100K iterations of Cellulose NVE twice with
>>>>>> the
>>>>>> >> same
>>>>>> >> >> random seed. if you don't get identically bit accurate
>>>>>> output,
>>>>>> your
>>>>>> >> >> GPU is
>>>>>> >> >> not working. Memtest programs do not catch this because (I am
>>>>>> >> guessing)
>>>>>> >> >> they are designed for a uniform memory hierarchy and only one
>>>>>> path
>>>>>> >> >> to
>>>>>> >> >> read
>>>>>> >> >> and write data. I have a stock GTX Titan that cannot pass the
>>>>>> >> Cellulose
>>>>>> >> >> NVE test and another one that does. I spent a couple days on
>>>>>> the
>>>>>> >> former
>>>>>> >> >> GPU looking for the imaginary bug that went away like magic
>>>>>> the
>>>>>> >> second I
>>>>>> >> >> switched out the GPU.
>>>>>> >> >>
>>>>>> >> >> Scott
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> On Tue, May 28, 2013 at 8:11 AM, Robert Konecny <rok.ucsd.edu>
>>>>>> wrote:
>>>>>> >> >>
>>>>>> >> >> > Hi Scott,
>>>>>> >> >> >
>>>>>> >> >> > unfortunately we are seeing similar Amber instability on GTX
>>>>>> >> Titans as
>>>>>> >> >> > Marek is. We have a box with four GTX Titans (not
>>>>>> oveclocked)
>>>>>> >> running
>>>>>> >> >> > CentOS 6.3 with NVidia 319.17 driver and Amber 12.2. Any
>>>>>> Amber
>>>>>> >> >> simulation
>>>>>> >> >> > longer than 10-15 min eventually crashes on these cards,
>>>>>> including
>>>>>> >> >> both
>>>>>> >> >> JAC
>>>>>> >> >> > benchmarks (with extended run time). This is reproducible on
>>>>>> all
>>>>>> >> four
>>>>>> >> >> > cards.
>>>>>> >> >> >
>>>>>> >> >> > To eliminate the possible hardware error we ran extended GPU
>>>>>> >> >> > memory
>>>>>> >> >> tests
>>>>>> >> >> > on all four Titans with memtestG80, cuda_memtest and also
>>>>>> gpu_burn
>>>>>> >> -
>>>>>> >> >> all
>>>>>> >> >> > finished without errors. Since I agree that these programs
>>>>>> may
>>>>>> not
>>>>>> >> >> test
>>>>>> >> >> the
>>>>>> >> >> > GPU completely we also set up simulations with NAMD. We can
>>>>>> run
>>>>>> >> four
>>>>>> >> >> NAMD
>>>>>> >> >> > simulations simultaneously for many days without any errors
>>>>>> on
>>>>>> >> >> > this
>>>>>> >> >> > hardware. For reference - we also have exactly the same
>>>>>> server
>>>>>> >> >> > with
>>>>>> >> >> the
>>>>>> >> >> > same hardware components but with four GTX680s and this
>>>>>> setup
>>>>>> >> >> > works
>>>>>> >> >> just
>>>>>> >> >> > fine for Amber. So all this leads me to believe that a
>>>>>> hardware
>>>>>> >> error
>>>>>> >> >> is
>>>>>> >> >> > not very likely.
>>>>>> >> >> >
>>>>>> >> >> > I would appreciate your comments on this, perhaps there is
>>>>>> >> something
>>>>>> >> >> else
>>>>>> >> >> > causing these errors which we are not seeing.
>>>>>> >> >> >
>>>>>> >> >> > Thanks,
>>>>>> >> >> >
>>>>>> >> >> > Robert
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > On Mon, May 27, 2013 at 04:25:24PM -0700, Scott Le Grand
>>>>>> wrote:
>>>>>> >> >> > > I have two GTX Titans. One is defective, the other is
>>>>>> not.
>>>>>> >> >> > Unfortunately,
>>>>>> >> >> > > they both pass all standard GPU memory tests.
>>>>>> >> >> > >
>>>>>> >> >> > > What the defective one doesn't do is generate reproducibly
>>>>>> >> >> bit-accurate
>>>>>> >> >> > > outputs for simulations of Factor IX (90,986 atoms) or
>>>>>> larger,
>>>>>> >> >> > > of
>>>>>> >> >> 100K
>>>>>> >> >> or
>>>>>> >> >> > > so iterations.
>>>>>> >> >> > >
>>>>>> >> >> > > Which is yet another reason why I insist on MD algorithms
>>>>>> >> >> (especially
>>>>>> >> >> on
>>>>>> >> >> > > GPUS) being deterministic. Besides its ability to find
>>>>>> software
>>>>>> >> >> bugs,
>>>>>> >> >> > and
>>>>>> >> >> > > fulfilling one of the most important tenets of science,
>>>>>> it's
>>>>>> a
>>>>>> >> great
>>>>>> >> >> way
>>>>>> >> >> > to
>>>>>> >> >> > > diagnose defective hardware with very little effort.
>>>>>> >> >> > >
>>>>>> >> >> > > 928 MHz? That's 6% above the boost clock of a stock
>>>>>> Titan.
>>>>>> >> Titan
>>>>>> >> >> is
>>>>>> >> >> > > pushing the performance envelope as is. If you're going
>>>>>> to
>>>>>> pay
>>>>>> >> the
>>>>>> >> >> > premium
>>>>>> >> >> > > for such chips, I'd send them back until you get one that
>>>>>> runs
>>>>>> >> >> correctly.
>>>>>> >> >> > > I'm very curious how fast you can push one of these things
>>>>>> >> >> > > before
>>>>>> >> >> they
>>>>>> >> >> > give
>>>>>> >> >> > > out.
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > >
>>>>>> >> >> > > On Mon, May 27, 2013 at 10:01 AM, Marek Maly
>>>>>> <marek.maly.ujep.cz
>>>>>> >
>>>>>> >> >> wrote:
>>>>>> >> >> > >
>>>>>> >> >> > > > Dear all,
>>>>>> >> >> > > >
>>>>>> >> >> > > > I have recently bought two "EVGA GTX TITAN Superclocked"
>>>>>> GPUs.
>>>>>> >> >> > > >
>>>>>> >> >> > > > I did the first calculations (pmemd.cuda in Amber12)
>>>>>> with
>>>>>> >> systems
>>>>>> >> >> > around
>>>>>> >> >> > > > 60K atoms without any problems (NPT, Langevin), but
>>>>>> when I
>>>>>> >> later
>>>>>> >> >> tried
>>>>>> >> >> > > > with bigger systems (around 100K atoms) I obtained
>>>>>> "classical"
>>>>>> >> >> > irritating
>>>>>> >> >> > > > errors
>>>>>> >> >> > > >
>>>>>> >> >> > > > cudaMemcpy GpuBuffer::Download failed unspecified launch
>>>>>> >> failure
>>>>>> >> >> > > >
>>>>>> >> >> > > > just after few thousands of MD steps.
>>>>>> >> >> > > >
>>>>>> >> >> > > > So this was obviously the reason for memtestG80 tests.
>>>>>> >> >> > > > ( https://simtk.org/home/memtest ).
>>>>>> >> >> > > >
>>>>>> >> >> > > > So I compiled memtestG80 from sources (
>>>>>> >> memtestG80-1.1-src.tar.gz
>>>>>> >> >> )
>>>>>> >> >> and
>>>>>> >> >> > > > then tested
>>>>>> >> >> > > > just small part of memory GPU (200 MB) using 100
>>>>>> iterations.
>>>>>> >> >> > > >
>>>>>> >> >> > > > On both cards I have obtained huge amount of errors but
>>>>>> "just"
>>>>>> >> on
>>>>>> >> >> > > > "Random blocks:". 0 errors in all remaining tests in all
>>>>>> >> >> iterations.
>>>>>> >> >> > > >
>>>>>> >> >> > > > ------THE LAST ITERATION AND FINAL RESULTS-------
>>>>>> >> >> > > >
>>>>>> >> >> > > > Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so
>>>>>> far
>>>>>> >> >> > > > Moving Inversions (ones and zeros): 0 errors (6
>>>>>> ms)
>>>>>> >> >> > > > Memtest86 Walking 8-bit: 0 errors (53 ms)
>>>>>> >> >> > > > True Walking zeros (8-bit): 0 errors (26 ms)
>>>>>> >> >> > > > True Walking ones (8-bit): 0 errors (26 ms)
>>>>>> >> >> > > > Moving Inversions (random): 0 errors (6 ms)
>>>>>> >> >> > > > Memtest86 Walking zeros (32-bit): 0 errors (105
>>>>>> ms)
>>>>>> >> >> > > > Memtest86 Walking ones (32-bit): 0 errors (104
>>>>>> ms)
>>>>>> >> >> > > > Random blocks: 1369863 errors (27 ms)
>>>>>> >> >> > > > Memtest86 Modulo-20: 0 errors (215 ms)
>>>>>> >> >> > > > Logic (one iteration): 0 errors (4 ms)
>>>>>> >> >> > > > Logic (4 iterations): 0 errors (8 ms)
>>>>>> >> >> > > > Logic (shared memory, one iteration): 0 errors
>>>>>> (8
>>>>>> ms)
>>>>>> >> >> > > > Logic (shared-memory, 4 iterations): 0 errors
>>>>>> (25
>>>>>> ms)
>>>>>> >> >> > > >
>>>>>> >> >> > > > Final error count after 100 iterations over 200 MiB of
>>>>>> GPU
>>>>>> >> memory:
>>>>>> >> >> > > > 171106710 errors
>>>>>> >> >> > > >
>>>>>> >> >> > > > ------------------------------**------------
>>>>>> >> >> > > >
>>>>>> >> >> > > > I have some questions and would be really grateful for
>>>>>> any
>>>>>> >> >> comments.
>>>>>> >> >> > > >
>>>>>> >> >> > > > Regarding overclocking, using the deviceQuery I found
>>>>>> out
>>>>>> that
>>>>>> >> >> under
>>>>>> >> >> > linux
>>>>>> >> >> > > > both cards run
>>>>>> >> >> > > > automatically using boost shader/GPU frequency which is
>>>>>> here
>>>>>> >> 928
>>>>>> >> >> MHz
>>>>>> >> >> > (the
>>>>>> >> >> > > > basic value for these factory OC cards is 876 MHz).
>>>>>> >> >> > > > deviceQuery
>>>>>> >> >> > reported
>>>>>> >> >> > > > Memory Clock rate is 3004 MHz although "it" should be
>>>>>> 6008
>>>>>> MHz
>>>>>> >> but
>>>>>> >> >> > maybe
>>>>>> >> >> > > > the quantity which is reported by deviceQuery "Memory
>>>>>> Clock
>>>>>> >> rate"
>>>>>> >> >> is
>>>>>> >> >> > > > different from the product specification "Memory Clock"
>>>>>> .
>>>>>> It
>>>>>> >> seems
>>>>>> >> >> that
>>>>>> >> >> > > > "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or
>>>>>> just
>>>>>> >> >> > deviceQuery
>>>>>> >> >> > > > is not able to read this spec. properly
>>>>>> >> >> > > > in Titan GPU ?
>>>>>> >> >> > > >
>>>>>> >> >> > > > Anyway for the moment I assume that the problem might be
>>>>>> due
>>>>>> >> >> > > > to
>>>>>> >> >> the
>>>>>> >> >> > high
>>>>>> >> >> > > > shader/GPU frequency.
>>>>>> >> >> > > > (see here :
>>>>>> http://folding.stanford.edu/**English/DownloadUtils<http://folding.stanford.edu/English/DownloadUtils>
>>>>>> )
>>>>>> >> >> > > >
>>>>>> >> >> > > > To verify this hypothesis one should perhaps UNDERclock
>>>>>> to
>>>>>> >> basic
>>>>>> >> >> > frequency
>>>>>> >> >> > > > which is in this
>>>>>> >> >> > > > model 876 MHz or even to the TITAN REFERENCE frequency
>>>>>> which
>>>>>> >> >> > > > is
>>>>>> >> >> 837
>>>>>> >> >> > MHz.
>>>>>> >> >> > > >
>>>>>> >> >> > > > Obviously I am working with these cards under linux
>>>>>> (CentOS
>>>>>> >> >> > > > 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools
>>>>>> under
>>>>>> >> >> linux
>>>>>> >> >> > are in
>>>>>> >> >> > > > fact limited just to NVclock utility, which is
>>>>>> unfortunately
>>>>>> >> >> > > > out of date (at least speaking about the GTX Titan ). I
>>>>>> have
>>>>>> >> >> obtained
>>>>>> >> >> > this
>>>>>> >> >> > > > message when I wanted
>>>>>> >> >> > > > just to let NVclock utility to read and print shader and
>>>>>> >> >> > > > memory
>>>>>> >> >> > > > frequencies of my Titan's:
>>>>>> >> >> > > >
>>>>>> >> >> > > >
>>>>>> >> >>
>>>>>> ------------------------------**------------------------------**
>>>>>> -------
>>>>>> >> >> > > >
>>>>>> >> >> > > > [root.dyn-138-272 NVCLOCK]# nvclock -s --speeds
>>>>>> >> >> > > > Card: Unknown Nvidia card
>>>>>> >> >> > > > Card number: 1
>>>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>>>> >> >> > > >
>>>>>> >> >> > > > Card: Unknown Nvidia card
>>>>>> >> >> > > > Card number: 2
>>>>>> >> >> > > > Memory clock: -2147483.750 MHz
>>>>>> >> >> > > > GPU clock: -2147483.750 MHz
>>>>>> >> >> > > >
>>>>>> >> >> > > >
>>>>>> >> >> > > >
>>>>>> >> >>
>>>>>> ------------------------------**------------------------------**
>>>>>> -------
>>>>>> >> >> > > >
>>>>>> >> >> > > >
>>>>>> >> >> > > > I would be really grateful for some tips regarding
>>>>>> "NVclock
>>>>>> >> >> > alternatives",
>>>>>> >> >> > > > but after wasting some hours with googling it seems that
>>>>>> there
>>>>>> >> is
>>>>>> >> >> no
>>>>>> >> >> > other
>>>>>> >> >> > > > Linux
>>>>>> >> >> > > > tool with NVclock functionality. So the only
>>>>>> possibility is
>>>>>> >> here
>>>>>> >> >> > perhaps
>>>>>> >> >> > > > to edit
>>>>>> >> >> > > > GPU bios with some Lin/DOS/Win tools like (Kepler BIOS
>>>>>> >> >> > > > Tweaker,
>>>>>> >> >> > NVflash)
>>>>>> >> >> > > > but obviously
>>>>>> >> >> > > > I would like to rather avoid such approach as using it
>>>>>> means
>>>>>> >> >> perhaps
>>>>>> >> >> > also
>>>>>> >> >> > > > to void the warranty even if I am going to underclock
>>>>>> the
>>>>>> GPUs
>>>>>> >> >> not to
>>>>>> >> >> > > > overclock them.
>>>>>> >> >> > > > So before this eventual step (GPU bios editing) I would
>>>>>> like
>>>>>> >> >> > > > to
>>>>>> >> >> have
>>>>>> >> >> > some
>>>>>> >> >> > > > approximative estimate
>>>>>> >> >> > > > of the probability, that the problems are here really
>>>>>> because
>>>>>> >> of
>>>>>> >> >> the
>>>>>> >> >> > > > overclocking
>>>>>> >> >> > > > (too high (boost) default shader frequency).
>>>>>> >> >> > > >
>>>>>> >> >> > > > This probability I hope to estimate from the eventual
>>>>>> >> responses of
>>>>>> >> >> > another
>>>>>> >> >> > > > Amber/Titan SC users, if I am not the only crazy guy who
>>>>>> >> >> > > > bought
>>>>>> >> >> this
>>>>>> >> >> > model
>>>>>> >> >> > > > for Amber calculations :)) But of course any eventual
>>>>>> >> experiences
>>>>>> >> >> with
>>>>>> >> >> > > > Titan cards related to their memtestG80 results and
>>>>>> >> >> UNDER/OVERclocking
>>>>>> >> >> > > > (if possible in Linux OS) are of course welcomed as
>>>>>> well !
>>>>>> >> >> > > >
>>>>>> >> >> > > > My HW/SW configuration
>>>>>> >> >> > > >
>>>>>> >> >> > > > motherboard: ASUS P9X79 PRO
>>>>>> >> >> > > > CPU: Intel Core i7-3930K
>>>>>> >> >> > > > RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
>>>>>> >> >> > > > CASE: CoolerMaster Dominator CM-690 II Advanced,
>>>>>> >> >> > > > Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
>>>>>> >> >> > > > GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
>>>>>> >> >> > > > cooler: Cooler Master Hyper 412 SLIM
>>>>>> >> >> > > >
>>>>>> >> >> > > > OS: CentOS (2.6.32-358.6.1.el6.x86_64)
>>>>>> >> >> > > > driver version: 319.17
>>>>>> >> >> > > > cudatoolkit_5.0.35_linux_64_**rhel6.x
>>>>>> >> >> > > >
>>>>>> >> >> > > > The computer is in air-conditioned room with permanent
>>>>>> >> >> > > > external
>>>>>> >> >> > > > temperature around 18°C
>>>>>> >> >> > > >
>>>>>> >> >> > > >
>>>>>> >> >> > > > Thanks a lot in advance for any comment/experience !
>>>>>> >> >> > > >
>>>>>> >> >> > > > Best wishes,
>>>>>> >> >> > > >
>>>>>> >> >> > > > Marek
>>>>>> >> >> > > >
>>>>>> >> >> > > > --
>>>>>> >> >> > > > Tato zpráva byla vytvořena převratným poštovním klientem
>>>>>> >> >> > > > Opery:
>>>>>> >> >> > > > http://www.opera.com/mail/
>>>>>> >> >> > > >
>>>>>> >> >> > > > ______________________________**_________________
>>>>>> >> >> > > > AMBER mailing list
>>>>>> >> >> > > > AMBER.ambermd.org
>>>>>> >> >> > > >
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> > > >
>>>>>> >> >> > > ______________________________**_________________
>>>>>> >> >> > > AMBER mailing list
>>>>>> >> >> > > AMBER.ambermd.org
>>>>>> >> >> > >
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >
>>>>>> >> >> > ______________________________**_________________
>>>>>> >> >> > AMBER mailing list
>>>>>> >> >> > AMBER.ambermd.org
>>>>>> >> >> >
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >
>>>>>> >> >> ______________________________**_________________
>>>>>> >> >> AMBER mailing list
>>>>>> >> >> AMBER.ambermd.org
>>>>>> >> >>
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >>
>>>>>> >> > ______________________________**_________________
>>>>>> >> > AMBER mailing list
>>>>>> >> > AMBER.ambermd.org
>>>>>> >> >
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >
>>>>>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze
>>>>>> 8385
>>>>>> >> > (20130528) __________
>>>>>> >> >
>>>>>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >> >
>>>>>> >> > http://www.eset.cz
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>>> >> http://www.opera.com/mail/
>>>>>> >>
>>>>>> >> ______________________________**_________________
>>>>>> >> AMBER mailing list
>>>>>> >> AMBER.ambermd.org
>>>>>> >>
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >>
>>>>>> > ______________________________**_________________
>>>>>> > AMBER mailing list
>>>>>> > AMBER.ambermd.org
>>>>>> >
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>>>>> > (20130528) __________
>>>>>> >
>>>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >
>>>>>> > http://www.eset.cz
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>>> http://www.opera.com/mail/
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>>
>>>>>> ______________________________**_________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>
>>>>>
>>>>> ______________________________**_________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>
>>>>
>>>>
>>>>
>>>> ______________________________**_________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>
>>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8386
>>>> (20130528) __________
>>>>
>>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>
>>>> http://www.eset.cz
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> http://www.opera.com/mail/
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8390
> (20130529) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 29 2013 - 15:00:02 PDT
Custom Search