Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Jonathan Gough <jonathan.d.gough.gmail.com>
Date: Tue, 4 Jun 2013 07:54:09 -0400

Perhaps this is the LeGrand uncertainty principal in action. The newtonian
wave function has collapsed?


On Mon, Jun 3, 2013 at 10:51 PM, Scott Le Grand <varelse2005.gmail.com>wrote:

> Update: The nucleosome GB irreproducibility is weird. it goes away on my
> Titan if I set ntpr to 1 (was trying to find the offending energy component
> that diverges first). Can you guys try this on your machines? I think
> this might be SW...
>
>
>
>
>
>
> On Mon, Jun 3, 2013 at 1:18 PM, ET <sketchfoot.gmail.com> wrote:
>
> > Hi Scott & Ross,
> >
> > I take it you will post to this thread once a fix has been found? :)
> >
> > br,
> > g
> >
> >
> > On 3 June 2013 20:31, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> > > OK,
> > > I just took deep breath and started to pray :))
> > >
> > > BTW, the difference between GB results TRPcage/myoglobin (perfectly
> > > reproducible)
> > > versus Nucleosome (irreproducible res.) might be connected with some
> > > differences
> > > in mdin parameters:
> > >
> > > TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
> > > Nucleosome simul. is also
> > > with restraint:
> > >
> > > RESTRAIN DNA
> > > 0.1
> > > RES 1 294
> > > END
> > > END
> > >
> > > I will try to experiment here to learn which parameter is responsible
> for
> > > the
> > > Nucleosome irreproducible results.
> > >
> > > M.
> > >
> > >
> > >
> > >
> > >
> > > Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker <ross.rosswalker.co.uk
> >
> > > napsal/-a:
> > >
> > > > Hi Marek,
> > > >
> > > > To be honest I would just take a deep breath and give us some time to
> > > > figure out what is going on with the Titan and work around it.
> > Hopefully
> > > > this won't take too long and we can have a patch out shortly.
> > > >
> > > > All the best
> > > > Ross
> > > >
> > > >
> > > >
> > > > On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
> > > >
> > > >> Thanks Scott !
> > > >>
> > > >> sounds me like "Of course you can win gold treasure if you survive
> > > >> Russian
> > > >> roulette before ..."
> > > >>
> > > >> It seems that the difference in reliability for sci. calc. between
> > > >> Teslas
> > > >>
> > > >> and "equivalent" stock GTXs
> > > >> is now (with chip GTK110) clearly bigger. I am curious how it will
> be
> > > >> with
> > > >> GTX 780 comparing to Titans.
> > > >>
> > > >> So let's hope that in the worst case downclocking of Titans might
> > solve
> > > >> the problem.
> > > >>
> > > >> BTW what is the working temperature of your K20c ? My Titans works
> > under
> > > >> 80°C (cca
> > > >> 60% Fan utilization). For the older cards (GTX 680/580 ...) this
> temp.
> > > >> should be OK but
> > > >> maybe for the GTK110 this temp is already too high to ensure zero
> "bit
> > > >> fluctuations".
> > > >>
> > > >> cuFFT is maybe responsible for crashes and maybe also some
> > > >> irreproducibility but the irreproducibility of the results will have
> > > >> also
> > > >>
> > > >> some another source as suggests
> > > >> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the
> real
> > > >> space calc.).
> > > >>
> > > >> So thanks for the moment and please let us know when you do some
> > > >> progress.
> > > >>
> > > >>
> > > >> M.
> > > >>
> > > >>
> > > >>
> > > >> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
> > > >> <varelse2005.gmail.com>
> > > >> napsal/-a:
> > > >>
> > > >>> Addressing Divi's two points:
> > > >>>
> > > >>> 1. We're trying to find a way to do this...
> > > >>>
> > > >>> 2. I am extremely paranoid and while I would still use the Titans
> for
> > > >>> development and testing, I would also currently do my publishable
> > runs
> > > >>> on
> > > >>> GK104 GPUs or K20s. Given that, if you're comfortable with
> > > >>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's
> going
> > > >>> on
> > > >>> here is seemingly no worse. I'm *not* comfortable with that myself
> > and
> > > >>> I
> > > >>> intend to find a fix or workaround like we did a couple years ago
> > with
> > > >>> GTX4xx and GTX5xx. So your best strategy might just be to wait a
> > week
> > > >>> or
> > > >>> two and see what comes of the bug hunt.
> > > >>>
> > > >>> Marek et al. if these GPU tests are failing on the Titans, then by
> > all
> > > >>> means return them without hesitation, but I don't think consumer
> > level
> > > >>> GPUs
> > > >>> are tested with the same level of rigor as Teslas. The upside is
> you
> > > >>> get
> > > >>> 30% better performance for 1/3 the price. The downside is that IMO
> > you
> > > >>> should be carefully validate them before using them. What I'm
> seeing
> > > >>> here
> > > >>> looks like single bit differences at the low-order bits that cause
> a
> > > >>> tiny
> > > >>> fluctuation that ultimately mushrooms and diverges the whole
> shebang
> > > >>> along
> > > >>> with occasional crashes. The crashes seem to occur in cuFFT
> > somewhere.
> > > >>>
> > > >>> I
> > > >>> have yet to see divergence there yet.
> > > >>>
> > > >>> Scott
> > > >>>
> > > >>>
> > > >>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz>
> > wrote:
> > > >>>
> > > >>>> Hi,
> > > >>>> so here are my NUCLEOSOME test results. All tests finished
> (although
> > > >>>> the
> > > >>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the
> 75K
> > > >>>> step
> > > >>>> so
> > > >>>> it is surprise for me that test was finished at the end). All the
> > > >>>> results
> > > >>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda
> > > >>>> 5.5)
> > > >>>> I
> > > >>>> will
> > > >>>> repeat it with CUDA 5.0.
> > > >>>>
> > > >>>> M.
> > > >>>>
> > > >>>> >>>>>> TITAN_0
> > > >>>>
> > > >>>>
> > > >>>> ROUND_1
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>>
> > > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60
> > PRESS
> > > >>>> = 0.0
> > > >>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
> > > >>>> -86534.3502
> > > >>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
> > > >>>> 16941.7678
> > > >>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
> > > >>>> -13647.8461
> > > >>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
> > > >>>> 359.6331
> > > >>>> EAMBER (non-restraint) = -86893.9832
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>> ROUND_2
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>>
> > > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =*********
> > PRESS
> > > >>>> = 0.0
> > > >>>> Etot = ************** EKtot = ************** EPtot =
> > > >>>> 4279668.7807
> > > >>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
> > > >>>> 67661.6797
> > > >>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
> > > >>>> 244.1012
> > > >>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
> > > >>>> -0.0000
> > > >>>> EAMBER (non-restraint) = 4279668.7807
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>> STARS from the 75k step ...
> > > >>>>
> > > >>>>
> > > >>>> >>>>>> TITAN_1
> > > >>>>
> > > >>>>
> > > >>>> ROUND_1
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>>
> > > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36
> > PRESS
> > > >>>> = 0.0
> > > >>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
> > > >>>> -86521.9289
> > > >>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
> > > >>>> 16996.9045
> > > >>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
> > > >>>> -13622.9343
> > > >>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
> > > >>>> 352.6371
> > > >>>> EAMBER (non-restraint) = -86874.5660
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>> ROUND_2
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>>
> > > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00
> > PRESS
> > > >>>> = 0.0
> > > >>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
> > > >>>> -86590.2649
> > > >>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
> > > >>>> 16918.8596
> > > >>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
> > > >>>> -13747.1032
> > > >>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
> > > >>>> 354.0348
> > > >>>> EAMBER (non-restraint) = -86944.2997
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> ------------------------------------------------------------------------
> > > >>>> ------
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <
> marek.maly.ujep.cz>
> > > >>>> napsal/-a:
> > > >>>>
> > > >>>> > OK, I will try NUCLEOSOME case as well with my latest
> > > >>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
> > > >>>> >
> > > >>>> > M.
> > > >>>> >
> > > >>>> >
> > > >>>> >
> > > >>>> >
> > > >>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
> > > >>>> napsal/-a:
> > > >>>> >
> > > >>>> >> Hi all,
> > > >>>> >>
> > > >>>> >> I reran the benchmark with Amber recompiled and at the latest
> > > >>>> drivers
> > > >>>> >> with
> > > >>>> >> GPU in solo configuration yields the following results:
> > > >>>> >>
> > > >>>> >>
> > > >>>> >> When I run the tests on GPU-00_TeaNCake:
> > > >>>> >>
> > > >>>> >> 1) All the tests (across 2x repeats) finish successfully:
> > > >>>> >>
> > > >>>> >>
> > > >>>> >> 2) The sdiff logs indicate that reproducibility across the two
> > > >>>> repeats
> > > >>>> >> is
> > > >>>> >> as follows:
> > > >>>> >>
> > > >>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
> > > >>>> >> GB_nucleosome: No reproducibility shown from step 3,400
> onwards.
> > > >>>> Also
> > > >>>> >> outfile is not written properly - blank gaps appear where
> > something
> > > >>>> >> should
> > > >>>> >> have been written.
> > > >>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
> > > >>>> >>
> > > >>>> >> PME_JAC_production_NVE: No reproducibility shown from step
> 35,000
> > > >>>> >> onwards.
> > > >>>> >> Also outfile is not written properly - blank gaps appear where
> > > >>>> something
> > > >>>> >> should have been written.
> > > >>>> >> PME_JAC_production_NPT: No reproducibility shown from step
> > 69,000
> > > >>>> >> onwards.
> > > >>>> >> Also outfile is not written properly - blank gaps appear where
> > > >>>> something
> > > >>>> >> should have been written.
> > > >>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
> > > >>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
> > > >>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
> > > >>>> >> PME_Cellulose_production_NPT: No reproducibility shown from
> step
> > > >>>> 17,000
> > > >>>> >> onwards. Also outfile is not written properly - blank gaps
> appear
> > > >>>> where
> > > >>>> >> something should have been written.
> > > >>>> >>
> > > >>>> >> #################################################
> > > >>>> >>
> > > >>>> >>
> > > >>>> >> So it looks like the problem does occur in GB runs too. Though
> I
> > > >>>> notice
> > > >>>> >> that running in single GPU mode seems to make the problem
> appear
> > > >>>> much
> > > >>>> >> later
> > > >>>> >> than it occurs with dual GPUs, though obviously this is quite
> > > >>>> >> qualitative
> > > >>>> >> and based only of 1 repeat.
> > > >>>> >>
> > > >>>> >> br,
> > > >>>> >> g
> > > >>>> >>
> > > >>>> >>
> > > >>>> >>
> > > >>>> >>
> > > >>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
> > > >>>> >>
> > > >>>> >>> Hi Marek,
> > > >>>> >>>
> > > >>>> >>> I think what you say about Valley and Heaven are true to a
> > certain
> > > >>>> >>> extent,
> > > >>>> >>> but I think the links I posted to the EVGA overclock utility &
> > MSI
> > > >>>> >>> Kombuster are very good ways of testing the card. I don't know
> > the
> > > >>>> >>> details
> > > >>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they
> are
> > > >>>> >>> testing
> > > >>>> >>> one very specific component. i.e. The Memory. As the graphics
> > card
> > > >>>> >>> consists
> > > >>>> >>> of more than this, it is better to have a test that checks the
> > > >>>> card
> > > >>>> in
> > > >>>> >>> a
> > > >>>> >>> more holistic manner IMO. :)
> > > >>>> >>>
> > > >>>> >>> I think this argument is supported by the fact that tech
> support
> > > >>>> at
> > > >>>> the
> > > >>>> >>> store used a program called FurMark to stress test the GPU. As
> > the
> > > >>>>
> > > >>>> GPU
> > > >>>> >>> I
> > > >>>> >>> returned kept failing the benchmark, they realized in less
> than
> > > >>>> half a
> > > >>>> >>> day
> > > >>>> >>> it was faulty, whilst I wasted a couple of days mucking about
> > with
> > > >>>>
> > > >>>> GPU
> > > >>>> >>> memory tests using Gpuburn on linux.
> > > >>>> >>>
> > > >>>> >>> http://www.ozone3d.net/benchmarks/fur/
> > > >>>> >>>
> > > >>>> >>> I think if you are going to test on windows, you are better of
> > > >>>> getting
> > > >>>> >>> MSI
> > > >>>> >>> Kombuster which I posted earlier. It contains the test
> contained
> > > >>>> in
> > > >>>> >>> Furmark
> > > >>>> >>> and many additional tests that test the compute capability of
> > the
> > > >>>> card.
> > > >>>> >>>
> > > >>>> >>> best regards,
> > > >>>> >>> g
> > > >>>> >>>
> > > >>>> >> _______________________________________________
> > > >>>> >> AMBER mailing list
> > > >>>> >> AMBER.ambermd.org
> > > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>> >>
> > > >>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze
> 8405
> > > >>>> >> (20130603) __________
> > > >>>> >>
> > > >>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
> > > >>>> >>
> > > >>>> >> http://www.eset.cz
> > > >>>> >>
> > > >>>> >>
> > > >>>> >>
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > > >>>> http://www.opera.com/mail/
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> AMBER mailing list
> > > >>>> AMBER.ambermd.org
> > > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>>
> > > >>> _______________________________________________
> > > >>> AMBER mailing list
> > > >>> AMBER.ambermd.org
> > > >>> http://lists.ambermd.org/mailman/listinfo/amber
> > > >>>
> > > >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
> > > >>> (20130603) __________
> > > >>>
> > > >>> Tuto zpravu proveril ESET NOD32 Antivirus.
> > > >>>
> > > >>> http://www.eset.cz
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > > >> http://www.opera.com/mail/
> > > >>
> > > >> _______________________________________________
> > > >> AMBER mailing list
> > > >> AMBER.ambermd.org
> > > >> http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > > __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
> > > > (20130603) __________
> > > >
> > > > Tuto zpravu proveril ESET NOD32 Antivirus.
> > > >
> > > > http://www.eset.cz
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > > http://www.opera.com/mail/
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 04 2013 - 05:00:02 PDT
Custom Search