Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: ET <sketchfoot.gmail.com>
Date: Mon, 3 Jun 2013 21:18:20 +0100

Hi Scott & Ross,

I take it you will post to this thread once a fix has been found? :)

br,
g


On 3 June 2013 20:31, Marek Maly <marek.maly.ujep.cz> wrote:

> OK,
> I just took deep breath and started to pray :))
>
> BTW, the difference between GB results TRPcage/myoglobin (perfectly
> reproducible)
> versus Nucleosome (irreproducible res.) might be connected with some
> differences
> in mdin parameters:
>
> TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
> Nucleosome simul. is also
> with restraint:
>
> RESTRAIN DNA
> 0.1
> RES 1 294
> END
> END
>
> I will try to experiment here to learn which parameter is responsible for
> the
> Nucleosome irreproducible results.
>
> M.
>
>
>
>
>
> Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker <ross.rosswalker.co.uk>
> napsal/-a:
>
> > Hi Marek,
> >
> > To be honest I would just take a deep breath and give us some time to
> > figure out what is going on with the Titan and work around it. Hopefully
> > this won't take too long and we can have a patch out shortly.
> >
> > All the best
> > Ross
> >
> >
> >
> > On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
> >
> >> Thanks Scott !
> >>
> >> sounds me like "Of course you can win gold treasure if you survive
> >> Russian
> >> roulette before ..."
> >>
> >> It seems that the difference in reliability for sci. calc. between
> >> Teslas
> >>
> >> and "equivalent" stock GTXs
> >> is now (with chip GTK110) clearly bigger. I am curious how it will be
> >> with
> >> GTX 780 comparing to Titans.
> >>
> >> So let's hope that in the worst case downclocking of Titans might solve
> >> the problem.
> >>
> >> BTW what is the working temperature of your K20c ? My Titans works under
> >> 80°C (cca
> >> 60% Fan utilization). For the older cards (GTX 680/580 ...) this temp.
> >> should be OK but
> >> maybe for the GTK110 this temp is already too high to ensure zero "bit
> >> fluctuations".
> >>
> >> cuFFT is maybe responsible for crashes and maybe also some
> >> irreproducibility but the irreproducibility of the results will have
> >> also
> >>
> >> some another source as suggests
> >> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the real
> >> space calc.).
> >>
> >> So thanks for the moment and please let us know when you do some
> >> progress.
> >>
> >>
> >> M.
> >>
> >>
> >>
> >> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
> >> <varelse2005.gmail.com>
> >> napsal/-a:
> >>
> >>> Addressing Divi's two points:
> >>>
> >>> 1. We're trying to find a way to do this...
> >>>
> >>> 2. I am extremely paranoid and while I would still use the Titans for
> >>> development and testing, I would also currently do my publishable runs
> >>> on
> >>> GK104 GPUs or K20s. Given that, if you're comfortable with
> >>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's going
> >>> on
> >>> here is seemingly no worse. I'm *not* comfortable with that myself and
> >>> I
> >>> intend to find a fix or workaround like we did a couple years ago with
> >>> GTX4xx and GTX5xx. So your best strategy might just be to wait a week
> >>> or
> >>> two and see what comes of the bug hunt.
> >>>
> >>> Marek et al. if these GPU tests are failing on the Titans, then by all
> >>> means return them without hesitation, but I don't think consumer level
> >>> GPUs
> >>> are tested with the same level of rigor as Teslas. The upside is you
> >>> get
> >>> 30% better performance for 1/3 the price. The downside is that IMO you
> >>> should be carefully validate them before using them. What I'm seeing
> >>> here
> >>> looks like single bit differences at the low-order bits that cause a
> >>> tiny
> >>> fluctuation that ultimately mushrooms and diverges the whole shebang
> >>> along
> >>> with occasional crashes. The crashes seem to occur in cuFFT somewhere.
> >>>
> >>> I
> >>> have yet to see divergence there yet.
> >>>
> >>> Scott
> >>>
> >>>
> >>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz> wrote:
> >>>
> >>>> Hi,
> >>>> so here are my NUCLEOSOME test results. All tests finished (although
> >>>> the
> >>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the 75K
> >>>> step
> >>>> so
> >>>> it is surprise for me that test was finished at the end). All the
> >>>> results
> >>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda
> >>>> 5.5)
> >>>> I
> >>>> will
> >>>> repeat it with CUDA 5.0.
> >>>>
> >>>> M.
> >>>>
> >>>> >>>>>> TITAN_0
> >>>>
> >>>>
> >>>> ROUND_1
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>>
> >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60 PRESS
> >>>> = 0.0
> >>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
> >>>> -86534.3502
> >>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
> >>>> 16941.7678
> >>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
> >>>> -13647.8461
> >>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
> >>>> 359.6331
> >>>> EAMBER (non-restraint) = -86893.9832
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>> ROUND_2
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>>
> >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =********* PRESS
> >>>> = 0.0
> >>>> Etot = ************** EKtot = ************** EPtot =
> >>>> 4279668.7807
> >>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
> >>>> 67661.6797
> >>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
> >>>> 244.1012
> >>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
> >>>> -0.0000
> >>>> EAMBER (non-restraint) = 4279668.7807
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>> STARS from the 75k step ...
> >>>>
> >>>>
> >>>> >>>>>> TITAN_1
> >>>>
> >>>>
> >>>> ROUND_1
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>>
> >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36 PRESS
> >>>> = 0.0
> >>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
> >>>> -86521.9289
> >>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
> >>>> 16996.9045
> >>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
> >>>> -13622.9343
> >>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
> >>>> 352.6371
> >>>> EAMBER (non-restraint) = -86874.5660
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>> ROUND_2
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>>
> >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00 PRESS
> >>>> = 0.0
> >>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
> >>>> -86590.2649
> >>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
> >>>> 16918.8596
> >>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
> >>>> -13747.1032
> >>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
> >>>> 354.0348
> >>>> EAMBER (non-restraint) = -86944.2997
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> ------
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <marek.maly.ujep.cz>
> >>>> napsal/-a:
> >>>>
> >>>> > OK, I will try NUCLEOSOME case as well with my latest
> >>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
> >>>> >
> >>>> > M.
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
> >>>> napsal/-a:
> >>>> >
> >>>> >> Hi all,
> >>>> >>
> >>>> >> I reran the benchmark with Amber recompiled and at the latest
> >>>> drivers
> >>>> >> with
> >>>> >> GPU in solo configuration yields the following results:
> >>>> >>
> >>>> >>
> >>>> >> When I run the tests on GPU-00_TeaNCake:
> >>>> >>
> >>>> >> 1) All the tests (across 2x repeats) finish successfully:
> >>>> >>
> >>>> >>
> >>>> >> 2) The sdiff logs indicate that reproducibility across the two
> >>>> repeats
> >>>> >> is
> >>>> >> as follows:
> >>>> >>
> >>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
> >>>> >> GB_nucleosome: No reproducibility shown from step 3,400 onwards.
> >>>> Also
> >>>> >> outfile is not written properly - blank gaps appear where something
> >>>> >> should
> >>>> >> have been written.
> >>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
> >>>> >>
> >>>> >> PME_JAC_production_NVE: No reproducibility shown from step 35,000
> >>>> >> onwards.
> >>>> >> Also outfile is not written properly - blank gaps appear where
> >>>> something
> >>>> >> should have been written.
> >>>> >> PME_JAC_production_NPT: No reproducibility shown from step 69,000
> >>>> >> onwards.
> >>>> >> Also outfile is not written properly - blank gaps appear where
> >>>> something
> >>>> >> should have been written.
> >>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
> >>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
> >>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
> >>>> >> PME_Cellulose_production_NPT: No reproducibility shown from step
> >>>> 17,000
> >>>> >> onwards. Also outfile is not written properly - blank gaps appear
> >>>> where
> >>>> >> something should have been written.
> >>>> >>
> >>>> >> #################################################
> >>>> >>
> >>>> >>
> >>>> >> So it looks like the problem does occur in GB runs too. Though I
> >>>> notice
> >>>> >> that running in single GPU mode seems to make the problem appear
> >>>> much
> >>>> >> later
> >>>> >> than it occurs with dual GPUs, though obviously this is quite
> >>>> >> qualitative
> >>>> >> and based only of 1 repeat.
> >>>> >>
> >>>> >> br,
> >>>> >> g
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
> >>>> >>
> >>>> >>> Hi Marek,
> >>>> >>>
> >>>> >>> I think what you say about Valley and Heaven are true to a certain
> >>>> >>> extent,
> >>>> >>> but I think the links I posted to the EVGA overclock utility & MSI
> >>>> >>> Kombuster are very good ways of testing the card. I don't know the
> >>>> >>> details
> >>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they are
> >>>> >>> testing
> >>>> >>> one very specific component. i.e. The Memory. As the graphics card
> >>>> >>> consists
> >>>> >>> of more than this, it is better to have a test that checks the
> >>>> card
> >>>> in
> >>>> >>> a
> >>>> >>> more holistic manner IMO. :)
> >>>> >>>
> >>>> >>> I think this argument is supported by the fact that tech support
> >>>> at
> >>>> the
> >>>> >>> store used a program called FurMark to stress test the GPU. As the
> >>>>
> >>>> GPU
> >>>> >>> I
> >>>> >>> returned kept failing the benchmark, they realized in less than
> >>>> half a
> >>>> >>> day
> >>>> >>> it was faulty, whilst I wasted a couple of days mucking about with
> >>>>
> >>>> GPU
> >>>> >>> memory tests using Gpuburn on linux.
> >>>> >>>
> >>>> >>> http://www.ozone3d.net/benchmarks/fur/
> >>>> >>>
> >>>> >>> I think if you are going to test on windows, you are better of
> >>>> getting
> >>>> >>> MSI
> >>>> >>> Kombuster which I posted earlier. It contains the test contained
> >>>> in
> >>>> >>> Furmark
> >>>> >>> and many additional tests that test the compute capability of the
> >>>> card.
> >>>> >>>
> >>>> >>> best regards,
> >>>> >>> g
> >>>> >>>
> >>>> >> _______________________________________________
> >>>> >> AMBER mailing list
> >>>> >> AMBER.ambermd.org
> >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >>
> >>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze 8405
> >>>> >> (20130603) __________
> >>>> >>
> >>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
> >>>> >>
> >>>> >> http://www.eset.cz
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>> --
> >>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >>>> http://www.opera.com/mail/
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>
> >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
> >>> (20130603) __________
> >>>
> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
> >>>
> >>> http://www.eset.cz
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> http://www.opera.com/mail/
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
> > (20130603) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 13:30:02 PDT
Custom Search