Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Tue, 04 Jun 2013 13:39:57 +0200

Hi,
  here are my results from the "NTPR" experiment:


Total energy at step 100 000 reported for ROUND_1 and ROUND_2
(driver 319.23, Amber12 bugfix 18 applied, cuda 5.0) (In all cases)

GTX580 (NTPR=1000)
-66801.3274
-66801.3274

TITAN_0 (NTPR=1)
-66854.0492
-66802.4419

TITAN_1 (NTPR=1)
  -66858.7444
  -66858.7444


       M.




Dne Tue, 04 Jun 2013 06:14:28 +0200 Marek Maly <marek.maly.ujep.cz>
napsal/-a:

> Hi Scott,
>
> I am sending again my very first tests/table (see attached) where
> I did also GTX 580/GTX 680 tests as a control and as you can see
> here I have obtained perfect reproducibility on those GTX but also
> on my second TITAN card (TITAN_1) for NUCLEOSOME ! But that was with
> driver 319.17
> (and also before bugfix 18).
>
> Now I will try on my titans again with ntpr=1 as you wish
> (driver 319.23, Amber12 bugfix 18 applied, cuda 5.0).
>
> Simultaneously I will repeat this test on GTX 580 with ntpr=1000
> (driver 319.23, Amber12 bugfix 18 applied, cuda 5.0).
>
> BTW I also experimented a bit, first try to use some settings from
> NUCLEOSOME (e.g. igb=5, ntt=1/3, saltcon=0.1, tautp=1.0 + restrains) and
> use it
> for TRP cage and Myoglob. assuming these params which are different
> between NUCLE and TRP + MYO will affect the TRP + MYO reproducibility.
>
> This was not confirmed i.e. TRP + MYO still perfectly reproducible.
>
> So then (to be sure) I did opposite exper. and used TRP mdin file for
> NUCLEOSOME to see
> if it influence NUCL reproducibility, but in agreement with "TRP-MYO"
> tests NUCL
> was again irreproducible ...
>
> So let's see the ntpr tests.
>
> M.
>
>
>
>
> Dne Tue, 04 Jun 2013 04:51:08 +0200 Scott Le Grand
> <varelse2005.gmail.com>
> napsal/-a:
>
>> Update: The nucleosome GB irreproducibility is weird. it goes away on
>> my
>> Titan if I set ntpr to 1 (was trying to find the offending energy
>> component
>> that diverges first). Can you guys try this on your machines? I think
>> this might be SW...
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 3, 2013 at 1:18 PM, ET <sketchfoot.gmail.com> wrote:
>>
>>> Hi Scott & Ross,
>>>
>>> I take it you will post to this thread once a fix has been found? :)
>>>
>>> br,
>>> g
>>>
>>>
>>> On 3 June 2013 20:31, Marek Maly <marek.maly.ujep.cz> wrote:
>>>
>>> > OK,
>>> > I just took deep breath and started to pray :))
>>> >
>>> > BTW, the difference between GB results TRPcage/myoglobin (perfectly
>>> > reproducible)
>>> > versus Nucleosome (irreproducible res.) might be connected with some
>>> > differences
>>> > in mdin parameters:
>>> >
>>> > TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
>>> > Nucleosome simul. is also
>>> > with restraint:
>>> >
>>> > RESTRAIN DNA
>>> > 0.1
>>> > RES 1 294
>>> > END
>>> > END
>>> >
>>> > I will try to experiment here to learn which parameter is responsible
>>> for
>>> > the
>>> > Nucleosome irreproducible results.
>>> >
>>> > M.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker
>>> <ross.rosswalker.co.uk>
>>> > napsal/-a:
>>> >
>>> > > Hi Marek,
>>> > >
>>> > > To be honest I would just take a deep breath and give us some time
>>> to
>>> > > figure out what is going on with the Titan and work around it.
>>> Hopefully
>>> > > this won't take too long and we can have a patch out shortly.
>>> > >
>>> > > All the best
>>> > > Ross
>>> > >
>>> > >
>>> > >
>>> > > On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>>> > >
>>> > >> Thanks Scott !
>>> > >>
>>> > >> sounds me like "Of course you can win gold treasure if you survive
>>> > >> Russian
>>> > >> roulette before ..."
>>> > >>
>>> > >> It seems that the difference in reliability for sci. calc. between
>>> > >> Teslas
>>> > >>
>>> > >> and "equivalent" stock GTXs
>>> > >> is now (with chip GTK110) clearly bigger. I am curious how it will
>>> be
>>> > >> with
>>> > >> GTX 780 comparing to Titans.
>>> > >>
>>> > >> So let's hope that in the worst case downclocking of Titans might
>>> solve
>>> > >> the problem.
>>> > >>
>>> > >> BTW what is the working temperature of your K20c ? My Titans works
>>> under
>>> > >> 80°C (cca
>>> > >> 60% Fan utilization). For the older cards (GTX 680/580 ...) this
>>> temp.
>>> > >> should be OK but
>>> > >> maybe for the GTK110 this temp is already too high to ensure zero
>>> "bit
>>> > >> fluctuations".
>>> > >>
>>> > >> cuFFT is maybe responsible for crashes and maybe also some
>>> > >> irreproducibility but the irreproducibility of the results will
>>> have
>>> > >> also
>>> > >>
>>> > >> some another source as suggests
>>> > >> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the
>>> real
>>> > >> space calc.).
>>> > >>
>>> > >> So thanks for the moment and please let us know when you do some
>>> > >> progress.
>>> > >>
>>> > >>
>>> > >> M.
>>> > >>
>>> > >>
>>> > >>
>>> > >> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
>>> > >> <varelse2005.gmail.com>
>>> > >> napsal/-a:
>>> > >>
>>> > >>> Addressing Divi's two points:
>>> > >>>
>>> > >>> 1. We're trying to find a way to do this...
>>> > >>>
>>> > >>> 2. I am extremely paranoid and while I would still use the Titans
>>> for
>>> > >>> development and testing, I would also currently do my publishable
>>> runs
>>> > >>> on
>>> > >>> GK104 GPUs or K20s. Given that, if you're comfortable with
>>> > >>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's
>>> going
>>> > >>> on
>>> > >>> here is seemingly no worse. I'm *not* comfortable with that
>>> myself
>>> and
>>> > >>> I
>>> > >>> intend to find a fix or workaround like we did a couple years ago
>>> with
>>> > >>> GTX4xx and GTX5xx. So your best strategy might just be to wait a
>>> week
>>> > >>> or
>>> > >>> two and see what comes of the bug hunt.
>>> > >>>
>>> > >>> Marek et al. if these GPU tests are failing on the Titans, then
>>> by
>>> all
>>> > >>> means return them without hesitation, but I don't think consumer
>>> level
>>> > >>> GPUs
>>> > >>> are tested with the same level of rigor as Teslas. The upside is
>>> you
>>> > >>> get
>>> > >>> 30% better performance for 1/3 the price. The downside is that
>>> IMO
>>> you
>>> > >>> should be carefully validate them before using them. What I'm
>>> seeing
>>> > >>> here
>>> > >>> looks like single bit differences at the low-order bits that
>>> cause a
>>> > >>> tiny
>>> > >>> fluctuation that ultimately mushrooms and diverges the whole
>>> shebang
>>> > >>> along
>>> > >>> with occasional crashes. The crashes seem to occur in cuFFT
>>> somewhere.
>>> > >>>
>>> > >>> I
>>> > >>> have yet to see divergence there yet.
>>> > >>>
>>> > >>> Scott
>>> > >>>
>>> > >>>
>>> > >>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz>
>>> wrote:
>>> > >>>
>>> > >>>> Hi,
>>> > >>>> so here are my NUCLEOSOME test results. All tests finished
>>> (although
>>> > >>>> the
>>> > >>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the
>>> 75K
>>> > >>>> step
>>> > >>>> so
>>> > >>>> it is surprise for me that test was finished at the end). All
>>> the
>>> > >>>> results
>>> > >>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied,
>>> cuda
>>> > >>>> 5.5)
>>> > >>>> I
>>> > >>>> will
>>> > >>>> repeat it with CUDA 5.0.
>>> > >>>>
>>> > >>>> M.
>>> > >>>>
>>> > >>>> >>>>>> TITAN_0
>>> > >>>>
>>> > >>>>
>>> > >>>> ROUND_1
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>>
>>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60
>>> PRESS
>>> > >>>> = 0.0
>>> > >>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot
>>> =
>>> > >>>> -86534.3502
>>> > >>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED
>>> =
>>> > >>>> 16941.7678
>>> > >>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS
>>> =
>>> > >>>> -13647.8461
>>> > >>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT
>>> =
>>> > >>>> 359.6331
>>> > >>>> EAMBER (non-restraint) = -86893.9832
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>> ROUND_2
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>>
>>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =*********
>>> PRESS
>>> > >>>> = 0.0
>>> > >>>> Etot = ************** EKtot = ************** EPtot
>>> =
>>> > >>>> 4279668.7807
>>> > >>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED
>>> =
>>> > >>>> 67661.6797
>>> > >>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS
>>> =
>>> > >>>> 244.1012
>>> > >>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT
>>> =
>>> > >>>> -0.0000
>>> > >>>> EAMBER (non-restraint) = 4279668.7807
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>> STARS from the 75k step ...
>>> > >>>>
>>> > >>>>
>>> > >>>> >>>>>> TITAN_1
>>> > >>>>
>>> > >>>>
>>> > >>>> ROUND_1
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>>
>>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36
>>> PRESS
>>> > >>>> = 0.0
>>> > >>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot
>>> =
>>> > >>>> -86521.9289
>>> > >>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED
>>> =
>>> > >>>> 16996.9045
>>> > >>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS
>>> =
>>> > >>>> -13622.9343
>>> > >>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT
>>> =
>>> > >>>> 352.6371
>>> > >>>> EAMBER (non-restraint) = -86874.5660
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>> ROUND_2
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>>
>>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00
>>> PRESS
>>> > >>>> = 0.0
>>> > >>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot
>>> =
>>> > >>>> -86590.2649
>>> > >>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED
>>> =
>>> > >>>> 16918.8596
>>> > >>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS
>>> =
>>> > >>>> -13747.1032
>>> > >>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT
>>> =
>>> > >>>> 354.0348
>>> > >>>> EAMBER (non-restraint) = -86944.2997
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> >
>>> ------------------------------------------------------------------------
>>> > >>>> ------
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly
>>> <marek.maly.ujep.cz>
>>> > >>>> napsal/-a:
>>> > >>>>
>>> > >>>> > OK, I will try NUCLEOSOME case as well with my latest
>>> > >>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda
>>> 5.5)
>>> > >>>> >
>>> > >>>> > M.
>>> > >>>> >
>>> > >>>> >
>>> > >>>> >
>>> > >>>> >
>>> > >>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
>>> > >>>> napsal/-a:
>>> > >>>> >
>>> > >>>> >> Hi all,
>>> > >>>> >>
>>> > >>>> >> I reran the benchmark with Amber recompiled and at the latest
>>> > >>>> drivers
>>> > >>>> >> with
>>> > >>>> >> GPU in solo configuration yields the following results:
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >> When I run the tests on GPU-00_TeaNCake:
>>> > >>>> >>
>>> > >>>> >> 1) All the tests (across 2x repeats) finish successfully:
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >> 2) The sdiff logs indicate that reproducibility across the
>>> two
>>> > >>>> repeats
>>> > >>>> >> is
>>> > >>>> >> as follows:
>>> > >>>> >>
>>> > >>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
>>> > >>>> >> GB_nucleosome: No reproducibility shown from step 3,400
>>> onwards.
>>> > >>>> Also
>>> > >>>> >> outfile is not written properly - blank gaps appear where
>>> something
>>> > >>>> >> should
>>> > >>>> >> have been written.
>>> > >>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
>>> > >>>> >>
>>> > >>>> >> PME_JAC_production_NVE: No reproducibility shown from step
>>> 35,000
>>> > >>>> >> onwards.
>>> > >>>> >> Also outfile is not written properly - blank gaps appear
>>> where
>>> > >>>> something
>>> > >>>> >> should have been written.
>>> > >>>> >> PME_JAC_production_NPT: No reproducibility shown from step
>>> 69,000
>>> > >>>> >> onwards.
>>> > >>>> >> Also outfile is not written properly - blank gaps appear
>>> where
>>> > >>>> something
>>> > >>>> >> should have been written.
>>> > >>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
>>> > >>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
>>> > >>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
>>> > >>>> >> PME_Cellulose_production_NPT: No reproducibility shown from
>>> step
>>> > >>>> 17,000
>>> > >>>> >> onwards. Also outfile is not written properly - blank gaps
>>> appear
>>> > >>>> where
>>> > >>>> >> something should have been written.
>>> > >>>> >>
>>> > >>>> >> #################################################
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >> So it looks like the problem does occur in GB runs too.
>>> Though I
>>> > >>>> notice
>>> > >>>> >> that running in single GPU mode seems to make the problem
>>> appear
>>> > >>>> much
>>> > >>>> >> later
>>> > >>>> >> than it occurs with dual GPUs, though obviously this is quite
>>> > >>>> >> qualitative
>>> > >>>> >> and based only of 1 repeat.
>>> > >>>> >>
>>> > >>>> >> br,
>>> > >>>> >> g
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
>>> > >>>> >>
>>> > >>>> >>> Hi Marek,
>>> > >>>> >>>
>>> > >>>> >>> I think what you say about Valley and Heaven are true to a
>>> certain
>>> > >>>> >>> extent,
>>> > >>>> >>> but I think the links I posted to the EVGA overclock utility
>>> &
>>> MSI
>>> > >>>> >>> Kombuster are very good ways of testing the card. I don't
>>> know
>>> the
>>> > >>>> >>> details
>>> > >>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they
>>> are
>>> > >>>> >>> testing
>>> > >>>> >>> one very specific component. i.e. The Memory. As the
>>> graphics
>>> card
>>> > >>>> >>> consists
>>> > >>>> >>> of more than this, it is better to have a test that checks
>>> the
>>> > >>>> card
>>> > >>>> in
>>> > >>>> >>> a
>>> > >>>> >>> more holistic manner IMO. :)
>>> > >>>> >>>
>>> > >>>> >>> I think this argument is supported by the fact that tech
>>> support
>>> > >>>> at
>>> > >>>> the
>>> > >>>> >>> store used a program called FurMark to stress test the GPU.
>>> As
>>> the
>>> > >>>>
>>> > >>>> GPU
>>> > >>>> >>> I
>>> > >>>> >>> returned kept failing the benchmark, they realized in less
>>> than
>>> > >>>> half a
>>> > >>>> >>> day
>>> > >>>> >>> it was faulty, whilst I wasted a couple of days mucking
>>> about
>>> with
>>> > >>>>
>>> > >>>> GPU
>>> > >>>> >>> memory tests using Gpuburn on linux.
>>> > >>>> >>>
>>> > >>>> >>> http://www.ozone3d.net/benchmarks/fur/
>>> > >>>> >>>
>>> > >>>> >>> I think if you are going to test on windows, you are better
>>> of
>>> > >>>> getting
>>> > >>>> >>> MSI
>>> > >>>> >>> Kombuster which I posted earlier. It contains the test
>>> contained
>>> > >>>> in
>>> > >>>> >>> Furmark
>>> > >>>> >>> and many additional tests that test the compute capability
>>> of
>>> the
>>> > >>>> card.
>>> > >>>> >>>
>>> > >>>> >>> best regards,
>>> > >>>> >>> g
>>> > >>>> >>>
>>> > >>>> >> _______________________________________________
>>> > >>>> >> AMBER mailing list
>>> > >>>> >> AMBER.ambermd.org
>>> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>> > >>>> >>
>>> > >>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze
>>> 8405
>>> > >>>> >> (20130603) __________
>>> > >>>> >>
>>> > >>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
>>> > >>>> >>
>>> > >>>> >> http://www.eset.cz
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >>
>>> > >>>> >
>>> > >>>> >
>>> > >>>>
>>> > >>>>
>>> > >>>> --
>>> > >>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> > >>>> http://www.opera.com/mail/
>>> > >>>>
>>> > >>>> _______________________________________________
>>> > >>>> AMBER mailing list
>>> > >>>> AMBER.ambermd.org
>>> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
>>> > >>>>
>>> > >>> _______________________________________________
>>> > >>> AMBER mailing list
>>> > >>> AMBER.ambermd.org
>>> > >>> http://lists.ambermd.org/mailman/listinfo/amber
>>> > >>>
>>> > >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
>>> > >>> (20130603) __________
>>> > >>>
>>> > >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>> > >>>
>>> > >>> http://www.eset.cz
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> > >> http://www.opera.com/mail/
>>> > >>
>>> > >> _______________________________________________
>>> > >> AMBER mailing list
>>> > >> AMBER.ambermd.org
>>> > >> http://lists.ambermd.org/mailman/listinfo/amber
>>> > >
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > AMBER mailing list
>>> > > AMBER.ambermd.org
>>> > > http://lists.ambermd.org/mailman/listinfo/amber
>>> > >
>>> > > __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
>>> > > (20130603) __________
>>> > >
>>> > > Tuto zpravu proveril ESET NOD32 Antivirus.
>>> > >
>>> > > http://www.eset.cz
>>> > >
>>> > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>> > http://www.opera.com/mail/
>>> >
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
>> (20130603) __________
>>
>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>
>> http://www.eset.cz
>>
>>
>>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 04 2013 - 05:00:03 PDT
Custom Search