Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Mon, 03 Jun 2013 21:31:03 +0200

OK,
I just took deep breath and started to pray :))

BTW, the difference between GB results TRPcage/myoglobin (perfectly
reproducible)
versus Nucleosome (irreproducible res.) might be connected with some
differences
in mdin parameters:

TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
Nucleosome simul. is also
with restraint:

RESTRAIN DNA
0.1
RES 1 294
END
END

I will try to experiment here to learn which parameter is responsible for
the
Nucleosome irreproducible results.

    M.





Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker <ross.rosswalker.co.uk>
napsal/-a:

> Hi Marek,
>
> To be honest I would just take a deep breath and give us some time to
> figure out what is going on with the Titan and work around it. Hopefully
> this won't take too long and we can have a patch out shortly.
>
> All the best
> Ross
>
>
>
> On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>
>> Thanks Scott !
>>
>> sounds me like "Of course you can win gold treasure if you survive
>> Russian
>> roulette before ..."
>>
>> It seems that the difference in reliability for sci. calc. between
>> Teslas
>>
>> and "equivalent" stock GTXs
>> is now (with chip GTK110) clearly bigger. I am curious how it will be
>> with
>> GTX 780 comparing to Titans.
>>
>> So let's hope that in the worst case downclocking of Titans might solve
>> the problem.
>>
>> BTW what is the working temperature of your K20c ? My Titans works under
>> 80°C (cca
>> 60% Fan utilization). For the older cards (GTX 680/580 ...) this temp.
>> should be OK but
>> maybe for the GTK110 this temp is already too high to ensure zero "bit
>> fluctuations".
>>
>> cuFFT is maybe responsible for crashes and maybe also some
>> irreproducibility but the irreproducibility of the results will have
>> also
>>
>> some another source as suggests
>> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the real
>> space calc.).
>>
>> So thanks for the moment and please let us know when you do some
>> progress.
>>
>>
>> M.
>>
>>
>>
>> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
>> <varelse2005.gmail.com>
>> napsal/-a:
>>
>>> Addressing Divi's two points:
>>>
>>> 1. We're trying to find a way to do this...
>>>
>>> 2. I am extremely paranoid and while I would still use the Titans for
>>> development and testing, I would also currently do my publishable runs
>>> on
>>> GK104 GPUs or K20s. Given that, if you're comfortable with
>>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's going
>>> on
>>> here is seemingly no worse. I'm *not* comfortable with that myself and
>>> I
>>> intend to find a fix or workaround like we did a couple years ago with
>>> GTX4xx and GTX5xx. So your best strategy might just be to wait a week
>>> or
>>> two and see what comes of the bug hunt.
>>>
>>> Marek et al. if these GPU tests are failing on the Titans, then by all
>>> means return them without hesitation, but I don't think consumer level
>>> GPUs
>>> are tested with the same level of rigor as Teslas. The upside is you
>>> get
>>> 30% better performance for 1/3 the price. The downside is that IMO you
>>> should be carefully validate them before using them. What I'm seeing
>>> here
>>> looks like single bit differences at the low-order bits that cause a
>>> tiny
>>> fluctuation that ultimately mushrooms and diverges the whole shebang
>>> along
>>> with occasional crashes. The crashes seem to occur in cuFFT somewhere.
>>>
>>> I
>>> have yet to see divergence there yet.
>>>
>>> Scott
>>>
>>>
>>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>>>
>>>> Hi,
>>>> so here are my NUCLEOSOME test results. All tests finished (although
>>>> the
>>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the 75K
>>>> step
>>>> so
>>>> it is surprise for me that test was finished at the end). All the
>>>> results
>>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda
>>>> 5.5)
>>>> I
>>>> will
>>>> repeat it with CUDA 5.0.
>>>>
>>>> M.
>>>>
>>>> >>>>>> TITAN_0
>>>>
>>>>
>>>> ROUND_1
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>>
>>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60 PRESS
>>>> = 0.0
>>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
>>>> -86534.3502
>>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
>>>> 16941.7678
>>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
>>>> -13647.8461
>>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
>>>> 359.6331
>>>> EAMBER (non-restraint) = -86893.9832
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>> ROUND_2
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>>
>>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =********* PRESS
>>>> = 0.0
>>>> Etot = ************** EKtot = ************** EPtot =
>>>> 4279668.7807
>>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
>>>> 67661.6797
>>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
>>>> 244.1012
>>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
>>>> -0.0000
>>>> EAMBER (non-restraint) = 4279668.7807
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>> STARS from the 75k step ...
>>>>
>>>>
>>>> >>>>>> TITAN_1
>>>>
>>>>
>>>> ROUND_1
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>>
>>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36 PRESS
>>>> = 0.0
>>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
>>>> -86521.9289
>>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
>>>> 16996.9045
>>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
>>>> -13622.9343
>>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
>>>> 352.6371
>>>> EAMBER (non-restraint) = -86874.5660
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>> ROUND_2
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>>
>>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00 PRESS
>>>> = 0.0
>>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
>>>> -86590.2649
>>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
>>>> 16918.8596
>>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
>>>> -13747.1032
>>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
>>>> 354.0348
>>>> EAMBER (non-restraint) = -86944.2997
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <marek.maly.ujep.cz>
>>>> napsal/-a:
>>>>
>>>> > OK, I will try NUCLEOSOME case as well with my latest
>>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
>>>> >
>>>> > M.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
>>>> napsal/-a:
>>>> >
>>>> >> Hi all,
>>>> >>
>>>> >> I reran the benchmark with Amber recompiled and at the latest
>>>> drivers
>>>> >> with
>>>> >> GPU in solo configuration yields the following results:
>>>> >>
>>>> >>
>>>> >> When I run the tests on GPU-00_TeaNCake:
>>>> >>
>>>> >> 1) All the tests (across 2x repeats) finish successfully:
>>>> >>
>>>> >>
>>>> >> 2) The sdiff logs indicate that reproducibility across the two
>>>> repeats
>>>> >> is
>>>> >> as follows:
>>>> >>
>>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
>>>> >> GB_nucleosome: No reproducibility shown from step 3,400 onwards.
>>>> Also
>>>> >> outfile is not written properly - blank gaps appear where something
>>>> >> should
>>>> >> have been written.
>>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
>>>> >>
>>>> >> PME_JAC_production_NVE: No reproducibility shown from step 35,000
>>>> >> onwards.
>>>> >> Also outfile is not written properly - blank gaps appear where
>>>> something
>>>> >> should have been written.
>>>> >> PME_JAC_production_NPT: No reproducibility shown from step 69,000
>>>> >> onwards.
>>>> >> Also outfile is not written properly - blank gaps appear where
>>>> something
>>>> >> should have been written.
>>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
>>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
>>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
>>>> >> PME_Cellulose_production_NPT: No reproducibility shown from step
>>>> 17,000
>>>> >> onwards. Also outfile is not written properly - blank gaps appear
>>>> where
>>>> >> something should have been written.
>>>> >>
>>>> >> #################################################
>>>> >>
>>>> >>
>>>> >> So it looks like the problem does occur in GB runs too. Though I
>>>> notice
>>>> >> that running in single GPU mode seems to make the problem appear
>>>> much
>>>> >> later
>>>> >> than it occurs with dual GPUs, though obviously this is quite
>>>> >> qualitative
>>>> >> and based only of 1 repeat.
>>>> >>
>>>> >> br,
>>>> >> g
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
>>>> >>
>>>> >>> Hi Marek,
>>>> >>>
>>>> >>> I think what you say about Valley and Heaven are true to a certain
>>>> >>> extent,
>>>> >>> but I think the links I posted to the EVGA overclock utility & MSI
>>>> >>> Kombuster are very good ways of testing the card. I don't know the
>>>> >>> details
>>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they are
>>>> >>> testing
>>>> >>> one very specific component. i.e. The Memory. As the graphics card
>>>> >>> consists
>>>> >>> of more than this, it is better to have a test that checks the
>>>> card
>>>> in
>>>> >>> a
>>>> >>> more holistic manner IMO. :)
>>>> >>>
>>>> >>> I think this argument is supported by the fact that tech support
>>>> at
>>>> the
>>>> >>> store used a program called FurMark to stress test the GPU. As the
>>>>
>>>> GPU
>>>> >>> I
>>>> >>> returned kept failing the benchmark, they realized in less than
>>>> half a
>>>> >>> day
>>>> >>> it was faulty, whilst I wasted a couple of days mucking about with
>>>>
>>>> GPU
>>>> >>> memory tests using Gpuburn on linux.
>>>> >>>
>>>> >>> http://www.ozone3d.net/benchmarks/fur/
>>>> >>>
>>>> >>> I think if you are going to test on windows, you are better of
>>>> getting
>>>> >>> MSI
>>>> >>> Kombuster which I posted earlier. It contains the test contained
>>>> in
>>>> >>> Furmark
>>>> >>> and many additional tests that test the compute capability of the
>>>> card.
>>>> >>>
>>>> >>> best regards,
>>>> >>> g
>>>> >>>
>>>> >> _______________________________________________
>>>> >> AMBER mailing list
>>>> >> AMBER.ambermd.org
>>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>>> >>
>>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze 8405
>>>> >> (20130603) __________
>>>> >>
>>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>> >>
>>>> >> http://www.eset.cz
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>> --
>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> http://www.opera.com/mail/
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
>>> (20130603) __________
>>>
>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>
>>> http://www.eset.cz
>>>
>>>
>>>
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
> (20130603) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 13:00:02 PDT
Custom Search