Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Scott Le Grand <varelse2005.gmail.com>
Date: Mon, 3 Jun 2013 19:51:08 -0700

Update: The nucleosome GB irreproducibility is weird. it goes away on my
Titan if I set ntpr to 1 (was trying to find the offending energy component
that diverges first). Can you guys try this on your machines? I think
this might be SW...






On Mon, Jun 3, 2013 at 1:18 PM, ET <sketchfoot.gmail.com> wrote:

> Hi Scott & Ross,
>
> I take it you will post to this thread once a fix has been found? :)
>
> br,
> g
>
>
> On 3 June 2013 20:31, Marek Maly <marek.maly.ujep.cz> wrote:
>
> > OK,
> > I just took deep breath and started to pray :))
> >
> > BTW, the difference between GB results TRPcage/myoglobin (perfectly
> > reproducible)
> > versus Nucleosome (irreproducible res.) might be connected with some
> > differences
> > in mdin parameters:
> >
> > TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
> > Nucleosome simul. is also
> > with restraint:
> >
> > RESTRAIN DNA
> > 0.1
> > RES 1 294
> > END
> > END
> >
> > I will try to experiment here to learn which parameter is responsible for
> > the
> > Nucleosome irreproducible results.
> >
> > M.
> >
> >
> >
> >
> >
> > Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker <ross.rosswalker.co.uk>
> > napsal/-a:
> >
> > > Hi Marek,
> > >
> > > To be honest I would just take a deep breath and give us some time to
> > > figure out what is going on with the Titan and work around it.
> Hopefully
> > > this won't take too long and we can have a patch out shortly.
> > >
> > > All the best
> > > Ross
> > >
> > >
> > >
> > > On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
> > >
> > >> Thanks Scott !
> > >>
> > >> sounds me like "Of course you can win gold treasure if you survive
> > >> Russian
> > >> roulette before ..."
> > >>
> > >> It seems that the difference in reliability for sci. calc. between
> > >> Teslas
> > >>
> > >> and "equivalent" stock GTXs
> > >> is now (with chip GTK110) clearly bigger. I am curious how it will be
> > >> with
> > >> GTX 780 comparing to Titans.
> > >>
> > >> So let's hope that in the worst case downclocking of Titans might
> solve
> > >> the problem.
> > >>
> > >> BTW what is the working temperature of your K20c ? My Titans works
> under
> > >> 80°C (cca
> > >> 60% Fan utilization). For the older cards (GTX 680/580 ...) this temp.
> > >> should be OK but
> > >> maybe for the GTK110 this temp is already too high to ensure zero "bit
> > >> fluctuations".
> > >>
> > >> cuFFT is maybe responsible for crashes and maybe also some
> > >> irreproducibility but the irreproducibility of the results will have
> > >> also
> > >>
> > >> some another source as suggests
> > >> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the real
> > >> space calc.).
> > >>
> > >> So thanks for the moment and please let us know when you do some
> > >> progress.
> > >>
> > >>
> > >> M.
> > >>
> > >>
> > >>
> > >> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
> > >> <varelse2005.gmail.com>
> > >> napsal/-a:
> > >>
> > >>> Addressing Divi's two points:
> > >>>
> > >>> 1. We're trying to find a way to do this...
> > >>>
> > >>> 2. I am extremely paranoid and while I would still use the Titans for
> > >>> development and testing, I would also currently do my publishable
> runs
> > >>> on
> > >>> GK104 GPUs or K20s. Given that, if you're comfortable with
> > >>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's going
> > >>> on
> > >>> here is seemingly no worse. I'm *not* comfortable with that myself
> and
> > >>> I
> > >>> intend to find a fix or workaround like we did a couple years ago
> with
> > >>> GTX4xx and GTX5xx. So your best strategy might just be to wait a
> week
> > >>> or
> > >>> two and see what comes of the bug hunt.
> > >>>
> > >>> Marek et al. if these GPU tests are failing on the Titans, then by
> all
> > >>> means return them without hesitation, but I don't think consumer
> level
> > >>> GPUs
> > >>> are tested with the same level of rigor as Teslas. The upside is you
> > >>> get
> > >>> 30% better performance for 1/3 the price. The downside is that IMO
> you
> > >>> should be carefully validate them before using them. What I'm seeing
> > >>> here
> > >>> looks like single bit differences at the low-order bits that cause a
> > >>> tiny
> > >>> fluctuation that ultimately mushrooms and diverges the whole shebang
> > >>> along
> > >>> with occasional crashes. The crashes seem to occur in cuFFT
> somewhere.
> > >>>
> > >>> I
> > >>> have yet to see divergence there yet.
> > >>>
> > >>> Scott
> > >>>
> > >>>
> > >>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz>
> wrote:
> > >>>
> > >>>> Hi,
> > >>>> so here are my NUCLEOSOME test results. All tests finished (although
> > >>>> the
> > >>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the 75K
> > >>>> step
> > >>>> so
> > >>>> it is surprise for me that test was finished at the end). All the
> > >>>> results
> > >>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda
> > >>>> 5.5)
> > >>>> I
> > >>>> will
> > >>>> repeat it with CUDA 5.0.
> > >>>>
> > >>>> M.
> > >>>>
> > >>>> >>>>>> TITAN_0
> > >>>>
> > >>>>
> > >>>> ROUND_1
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>>
> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60
> PRESS
> > >>>> = 0.0
> > >>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
> > >>>> -86534.3502
> > >>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
> > >>>> 16941.7678
> > >>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
> > >>>> -13647.8461
> > >>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
> > >>>> 359.6331
> > >>>> EAMBER (non-restraint) = -86893.9832
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>> ROUND_2
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>>
> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =*********
> PRESS
> > >>>> = 0.0
> > >>>> Etot = ************** EKtot = ************** EPtot =
> > >>>> 4279668.7807
> > >>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
> > >>>> 67661.6797
> > >>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
> > >>>> 244.1012
> > >>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
> > >>>> -0.0000
> > >>>> EAMBER (non-restraint) = 4279668.7807
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>> STARS from the 75k step ...
> > >>>>
> > >>>>
> > >>>> >>>>>> TITAN_1
> > >>>>
> > >>>>
> > >>>> ROUND_1
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>>
> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36
> PRESS
> > >>>> = 0.0
> > >>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
> > >>>> -86521.9289
> > >>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
> > >>>> 16996.9045
> > >>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
> > >>>> -13622.9343
> > >>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
> > >>>> 352.6371
> > >>>> EAMBER (non-restraint) = -86874.5660
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>> ROUND_2
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>>
> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00
> PRESS
> > >>>> = 0.0
> > >>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
> > >>>> -86590.2649
> > >>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
> > >>>> 16918.8596
> > >>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
> > >>>> -13747.1032
> > >>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
> > >>>> 354.0348
> > >>>> EAMBER (non-restraint) = -86944.2997
> > >>>>
> > >>>>
> > >>>>
> > ------------------------------------------------------------------------
> > >>>> ------
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <marek.maly.ujep.cz>
> > >>>> napsal/-a:
> > >>>>
> > >>>> > OK, I will try NUCLEOSOME case as well with my latest
> > >>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
> > >>>> >
> > >>>> > M.
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
> > >>>> napsal/-a:
> > >>>> >
> > >>>> >> Hi all,
> > >>>> >>
> > >>>> >> I reran the benchmark with Amber recompiled and at the latest
> > >>>> drivers
> > >>>> >> with
> > >>>> >> GPU in solo configuration yields the following results:
> > >>>> >>
> > >>>> >>
> > >>>> >> When I run the tests on GPU-00_TeaNCake:
> > >>>> >>
> > >>>> >> 1) All the tests (across 2x repeats) finish successfully:
> > >>>> >>
> > >>>> >>
> > >>>> >> 2) The sdiff logs indicate that reproducibility across the two
> > >>>> repeats
> > >>>> >> is
> > >>>> >> as follows:
> > >>>> >>
> > >>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
> > >>>> >> GB_nucleosome: No reproducibility shown from step 3,400 onwards.
> > >>>> Also
> > >>>> >> outfile is not written properly - blank gaps appear where
> something
> > >>>> >> should
> > >>>> >> have been written.
> > >>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
> > >>>> >>
> > >>>> >> PME_JAC_production_NVE: No reproducibility shown from step 35,000
> > >>>> >> onwards.
> > >>>> >> Also outfile is not written properly - blank gaps appear where
> > >>>> something
> > >>>> >> should have been written.
> > >>>> >> PME_JAC_production_NPT: No reproducibility shown from step
> 69,000
> > >>>> >> onwards.
> > >>>> >> Also outfile is not written properly - blank gaps appear where
> > >>>> something
> > >>>> >> should have been written.
> > >>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
> > >>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
> > >>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
> > >>>> >> PME_Cellulose_production_NPT: No reproducibility shown from step
> > >>>> 17,000
> > >>>> >> onwards. Also outfile is not written properly - blank gaps appear
> > >>>> where
> > >>>> >> something should have been written.
> > >>>> >>
> > >>>> >> #################################################
> > >>>> >>
> > >>>> >>
> > >>>> >> So it looks like the problem does occur in GB runs too. Though I
> > >>>> notice
> > >>>> >> that running in single GPU mode seems to make the problem appear
> > >>>> much
> > >>>> >> later
> > >>>> >> than it occurs with dual GPUs, though obviously this is quite
> > >>>> >> qualitative
> > >>>> >> and based only of 1 repeat.
> > >>>> >>
> > >>>> >> br,
> > >>>> >> g
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
> > >>>> >>
> > >>>> >>> Hi Marek,
> > >>>> >>>
> > >>>> >>> I think what you say about Valley and Heaven are true to a
> certain
> > >>>> >>> extent,
> > >>>> >>> but I think the links I posted to the EVGA overclock utility &
> MSI
> > >>>> >>> Kombuster are very good ways of testing the card. I don't know
> the
> > >>>> >>> details
> > >>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they are
> > >>>> >>> testing
> > >>>> >>> one very specific component. i.e. The Memory. As the graphics
> card
> > >>>> >>> consists
> > >>>> >>> of more than this, it is better to have a test that checks the
> > >>>> card
> > >>>> in
> > >>>> >>> a
> > >>>> >>> more holistic manner IMO. :)
> > >>>> >>>
> > >>>> >>> I think this argument is supported by the fact that tech support
> > >>>> at
> > >>>> the
> > >>>> >>> store used a program called FurMark to stress test the GPU. As
> the
> > >>>>
> > >>>> GPU
> > >>>> >>> I
> > >>>> >>> returned kept failing the benchmark, they realized in less than
> > >>>> half a
> > >>>> >>> day
> > >>>> >>> it was faulty, whilst I wasted a couple of days mucking about
> with
> > >>>>
> > >>>> GPU
> > >>>> >>> memory tests using Gpuburn on linux.
> > >>>> >>>
> > >>>> >>> http://www.ozone3d.net/benchmarks/fur/
> > >>>> >>>
> > >>>> >>> I think if you are going to test on windows, you are better of
> > >>>> getting
> > >>>> >>> MSI
> > >>>> >>> Kombuster which I posted earlier. It contains the test contained
> > >>>> in
> > >>>> >>> Furmark
> > >>>> >>> and many additional tests that test the compute capability of
> the
> > >>>> card.
> > >>>> >>>
> > >>>> >>> best regards,
> > >>>> >>> g
> > >>>> >>>
> > >>>> >> _______________________________________________
> > >>>> >> AMBER mailing list
> > >>>> >> AMBER.ambermd.org
> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >>
> > >>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze 8405
> > >>>> >> (20130603) __________
> > >>>> >>
> > >>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
> > >>>> >>
> > >>>> >> http://www.eset.cz
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > >>>> http://www.opera.com/mail/
> > >>>>
> > >>>> _______________________________________________
> > >>>> AMBER mailing list
> > >>>> AMBER.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>> _______________________________________________
> > >>> AMBER mailing list
> > >>> AMBER.ambermd.org
> > >>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>
> > >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
> > >>> (20130603) __________
> > >>>
> > >>> Tuto zpravu proveril ESET NOD32 Antivirus.
> > >>>
> > >>> http://www.eset.cz
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > >> http://www.opera.com/mail/
> > >>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
> > > (20130603) __________
> > >
> > > Tuto zpravu proveril ESET NOD32 Antivirus.
> > >
> > > http://www.eset.cz
> > >
> > >
> > >
> >
> >
> > --
> > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> > http://www.opera.com/mail/
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 20:00:03 PDT
Custom Search